pyspark dataframe returning different characters \"\" instead of nulls

pyspark dataframe returning different characters \"\" instead of nulls - pyspark

I was reading a fixed with file from hadoop and doing substr and converting it to delimiter file. Code is working fine but instead of emply values in case of null it is returning \"\". Could you please suggest?
snippet
df.select(
df.value.substr(31, 1).alias('status'),
df.value.substr(32, 1).alias('tin_cert'),
df.value.substr(116, 1).alias('c_notice_flg'),
df.value.substr(117, 2).alias('nbr_non_prime_trlrs'),
df.value.substr(119, 3).alias('aw_related')
).write.option("delimiter", "|").csv(unixFile)
output
|\"\"|0|N|00|\"\"|199|
desired output
||0|N|00||199|
no quotes in the input file
000000000014999999999 281AAAA AAAAAAA AAAA 1NN00
000000000024 200BBBBB BBBBBBBBBBBBBBBBB 0NN00
000000000034 200 0NN00
000000000044 200 0NN00

I think escaped quotes are added because of default arguments for pyspark.sql.DataFrameWriter.csv method. In fact, as you can see from the docs:
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If an empty string is set, it uses u0000 (null character).
escape – sets a single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, \

Related

How to remove double Quotes In DataStage using a transformer stage?

We receiving Input data like below
“VENKATA,KRISHNA”
I want output like below
VENKATA,KRISHNA
Can anyone help me with this

Check out the Ereplace function - it allows to replace certain characters so you could rplace " with '' (empty string).
An alternative is TRIM - you can specify which character the command should trim and also if All occurrences or Both (from both sides of the string) plus more.

SalesForce Spark Delimiter issue

I have a glue job, in which am reading table from SF using soql:
df = (
spark.read.format("com.springml.spark.salesforce")
.option("soql", sql)
.option("queryAll", "true")
.option("sfObject", sf_table)
.option("bulk", bulk)
.option("pkChunking", pkChunking)
.option("version", "51.0")
.option("timeout", "99999999")
.option("username", login)
.option("password", password)
.load()
)
and whenever there is a combination of double-quotes and commas in the string it messes up my table schema, like so:
in source:
Column A
Column B
Column C
000AB
"text with, comma"
123XX
read from SF in df :
Column A
Column B
Column C
000AB
"text with
comma"
Is there any option to avoid such cases when this comma is treated as a delimiter? I tried various options but nothing worked. And SOQL doesn't accept REPLACE or SUBSTRING functions, their text manipulation functions are, well, basically there aren't any.

All the information I'm giving need to be tested. I do not have the same env so it is difficult for me to try anything but here is what I foud.
When you check the official doc, you find that there is a field metadataConfig. The documentation of this field can be found here : https://resources.docs.salesforce.com/sfdc/pdf/bi_dev_guide_ext_data_format.pdf
On page 2, csv format, it says :
If a field value contains a control character or a new line the field value must be contained within double quotes (or your
fieldsEscapedBy value). The default control characters (fieldsDelimitedBy, fieldsEnclosedBy,
fieldsEscapedBy, or linesTerminatedBy) are comma and double quote. For example, "Director of
Operations, Western Region".
which kinda sounds like you current problem.
By default, the values are comma and double quotes, so, I do not understand why it is failing. But, apparently, in your output, it keeps the double quotes, so, maybe, it considers only simple quote.
You should try to enforce the format and add in you code :
.option("metadataConfig", '{"fieldsEnclosedBy": "\"", "fieldsDelimitedBy": ","}')
# Or something similar - i could'nt test, so you need to try by yourself

db2 remove all non-alphanumeric, including non-printable, and special characters

This may sound like a duplicate, but existing solutions does not work.
I need to remove all non-alphanumerics from a varchar field. I'm using the following but it doesn't work in all cases (it works with diamond questionmark characters):
select TRANSLATE(FIELDNAME, '?',
TRANSLATE(FIELDNAME , '', 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'))
from TABLENAME
What it's doing is the inner translate parse all non-alphanumeric characters, then the outer translate replace them all with a '?'. This seems to work for replacement character�. However, it throws The second, third or fourth argument of the TRANSLATE scalar function is incorrect. which is expected according to IBM:
The TRANSLATE scalar function does not allow replacement of a character by another character which is encoded using a different number of bytes. The second and third arguments of the TRANSLATE scalar function must end with correctly formed characters.
Is there anyway to get around this?
Edit: #Paul Vernon's solution seems to be working:
· 6005308 ??6005308
–6009908 ?6009908
–6011177 ?6011177
��6011183�� ??6011183??

Try regexp_replace(c,'[^\w\d]','') or regexp_replace(c,'[^a-zA-Z\d]','')
E.g.
select regexp_replace(c,'[^a-zA-Z\d]','') from table(values('AB_- C$£abc�$123£')) t(c)
which returns
1
---------
ABCabc123
BTW Note that the allowed regular expression patterns are listed on this page Regular expression control characters
Outside of a set, the following must be preceded with a backslash to be treated as a literal
* ? + [ ( ) { } ^ $ | \ . /
Inside a set, the follow must be preceded with a backslash to be treated as a literal
Characters that must be quoted to be treated as literals are [ ] \
Characters that might need to be quoted, depending on the context are - &

how to remove # character from national data type in cobol

i am facing issue while converting unicode data into national characters.
When i convert the Unicode data into national using national-of function, some junk character like # is appended after the string.
E.g
Ws-unicode pic X(200)
Ws-national pic N(600)
--let the value in Ws-Unicode is これらの変更は. getting from java end.
move function national-of ( Ws-unicode ,1208 ) to Ws-national.
--after converting value is like これらの変更は #.
i do not want the extra # character added after conversion.
please help me to find out the possible solution, i have tried to replace N'#' with space using inspect clause.
it worked well but failed in some specific scenario like if we have # in input from user end. in that case genuine # also converted to space.

Below is a snippet of code I used to convert EBCDIC to UTF. Before I was capturing string lengths, I was also getting # symbols:
STRING
FUNCTION DISPLAY-OF (
FUNCTION NATIONAL-OF (
WS-EBCDIC-STRING(1:WS-XML-EBCDIC-LENGTH)
WS-EBCDIC-CCSID
)
WS-UTF8-CCSID
)
DELIMITED BY SIZE
INTO WS-UTF8-STRING
WITH POINTER WS-XML-UTF8-LENGTH
END-STRING
SUBTRACT 1 FROM WS-XML-UTF8-LENGTH
What this code does is string the UTF8 representation of the EBCIDIC string into another variable. The WITH POINTER clause will capture the new length of the string + 1 (+ 1 because the pointer is positioned to the next position after the string ended).
Using this method, you should be able to know exactly how long second string is and use that string with the exact length.
That should remove the unwanted #s.
EDIT:
One thing I forgot to mention, in my case, the # signs were actually EBCDIC low values when viewing the actual hex on the mainframe

Use inspect with reverse and stop after first occurence of #

How to import empty strings as null values from CSV file - using pgloader?

I am using pgloader to import from a .csv file which has empty strings in double quotes. A sample line is
12334,0,"MAIL","CA","","Sanfransisco","TX","","",""
After a successful import, the fields that has double quotes ("") are shown as two single quotes('') in postgres database.
Is there a way we can insert a null or even empty string in place of two single quotes('')?
I am using the arguments -
WITH truncate,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
SET client_encoding to 'UTF-8',
work_mem to '12MB',
standard_conforming_strings to 'on'
I tried using 'empty-string-to-null' mentioned in the documentation like this -
CAST column enumerate.fax using empty-string-to-null
But it gives me an error saying -
pgloader nph_opr_addr.test.load An unhandled error condition has been
signalled: At LOAD CSV
^ (Line 1, Column 0, Position 0) Could not parse subexpression ";"
when parsing

Use the field option:
null if blanks
Something like this:
...
having fields foo, bar, mynullcol null if blanks, baz
From the documentation:
null if
This option takes an argument which is either the keyword blanks or a double-quoted string.
When blanks is used and the field value that is read contains only space characters, then it's automatically converted to an SQL NULL value.
When a double-quoted string is used and that string is read as the field value, then the field value is automatically converted to an SQL NULL value

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark dataframe returning different characters \"\" instead of nulls - pyspark

Related

How to remove double Quotes In DataStage using a transformer stage?

SalesForce Spark Delimiter issue

db2 remove all non-alphanumeric, including non-printable, and special characters

how to remove # character from national data type in cobol

How to import empty strings as null values from CSV file - using pgloader?

Categories

Resources