SalesForce Spark Delimiter issue - pyspark

I have a glue job, in which am reading table from SF using soql:
df = (
spark.read.format("com.springml.spark.salesforce")
.option("soql", sql)
.option("queryAll", "true")
.option("sfObject", sf_table)
.option("bulk", bulk)
.option("pkChunking", pkChunking)
.option("version", "51.0")
.option("timeout", "99999999")
.option("username", login)
.option("password", password)
.load()
)
and whenever there is a combination of double-quotes and commas in the string it messes up my table schema, like so:
in source:
Column A
Column B
Column C
000AB
"text with, comma"
123XX
read from SF in df :
Column A
Column B
Column C
000AB
"text with
comma"
Is there any option to avoid such cases when this comma is treated as a delimiter? I tried various options but nothing worked. And SOQL doesn't accept REPLACE or SUBSTRING functions, their text manipulation functions are, well, basically there aren't any.

All the information I'm giving need to be tested. I do not have the same env so it is difficult for me to try anything but here is what I foud.
When you check the official doc, you find that there is a field metadataConfig. The documentation of this field can be found here : https://resources.docs.salesforce.com/sfdc/pdf/bi_dev_guide_ext_data_format.pdf
On page 2, csv format, it says :
If a field value contains a control character or a new line the field value must be contained within double quotes (or your
fieldsEscapedBy value). The default control characters (fieldsDelimitedBy, fieldsEnclosedBy,
fieldsEscapedBy, or linesTerminatedBy) are comma and double quote. For example, "Director of
Operations, Western Region".
which kinda sounds like you current problem.
By default, the values are comma and double quotes, so, I do not understand why it is failing. But, apparently, in your output, it keeps the double quotes, so, maybe, it considers only simple quote.
You should try to enforce the format and add in you code :
.option("metadataConfig", '{"fieldsEnclosedBy": "\"", "fieldsDelimitedBy": ","}')
# Or something similar - i could'nt test, so you need to try by yourself

Related

Convert comma separated non json string to json

Below is the value of a string in a text column.
select col1 from tt_d_tab;
'A:10000000,B:50000000,C:1000000,D:10000000,E:10000000'
I'm trying to convert it into json of below format.
'{"A": 10000000,"B": 50000000,"C": 1000000,"D": 10000000,"E": 10000000}'
Can someone help on this?
If you know that neither the keys nor values will have : or , characters in them, you can write
select json_object(regexp_split_to_array(col1,'[:,]')) from tt_d_tab;
This splits the string on every colon and comma, then interprets the result as key/value pairs.
If the string manipulation gets any more complicated, SQL may not be the ideal tool for the job, but it's still doable, either by this method or by converting the string into the form you need directly and then casting it to json with ::json.
If your key is a single capital letter as in your example
select concat('{',regexp_replace('A:10000000,B:50000000,C:1000000,D:10000000,E:10000000','([A-Z])','"\1"','g'),'}')::json json_field;
A more general case with any number of letters caps or not
select concat('{',regexp_replace('Ac:10000000,BT:50000000,Cs:1000000,D:10000000,E:10000000','([a-zA-Z]+)','"\1"','g'),'}')::json json_field;

Pyspark : Reading csv files with fields having double quotes and comas

I have a csv file which I am reading thru pyspark and loading into postgresql. One of its field is having strings which have coma and double quotes within the string. Like example below -
1. "RACER ""K"", P.L. 9"
2. "JENIS, B. S. ""N"" JENIS, F. T. ""B"" 5"
Pyspark is parsing it as below. Which is causing issue because it is mixing up the values/columns when I load the data into postgresql and script fail.
1. '\"RACER \"\"K\"\"'
2. '\"JENIS, B. S. \"\"N\"\" JENIS'
I am using spark 2.42. How can this situation be handled in pyspark?
Basically I want to program to ignore coma or double quotes if it is coming inside double quotes.
You can try and remove the comma and double quotes using pandas before reading and loading into postgresql.
You can use str.replace:
df['column_name'] = df['column_name'].str.replace(r"[\"\',]", '')

Remove quotes for String in Clickhouse while exporting

I'm trying to export data to csv from clickhouse cli.
I have a field which is string and when exported to CSV this field has quotes around it.
I want to export without the quotes but couldn't find any setting that can be set.
I went through https://clickhouse.yandex/docs/en/interfaces/formats but the Values section mentions
Strings, dates, and dates with times are output in quotes
While for JSON they have a flag that is to be set for removing quotes around Int64 and UInt64
For compatibility with JavaScript, Int64 and UInt64 integers are enclosed in double quotes by default. To remove the quotes, you can set the configuration parameter output_format_json_quote_64bit_integers to 0.
I was wondering if there is such kind of flag for strings in CSV as well.
I'm exporting using the below command
clickhouse client --multiquery --host="localhost" --port="9000" --query="SELECT field1, field2 from tableName format CSV" > /data/content.csv
I want to try removing the quotes from the shell as the last thing if nothing works.
Any help on the way I can remove the quotes while the CSV is generated would be appreciated.
Nope, there isn't. However you can easily achieve this by arrayStringConcat.
SELECT arrayStringConcat([toString(field1), toString(field2)], ',') from tableName format TSV;
Edit
In order to make Nullable output as empty string, you might need if function.
if(isNull(field1), '', assumeNotNull(field1))
This works for any types, while assumeNotNull alone only works for String

Vertica COPY + FLEX table

I want to load on a flex table a log in which each record is composed by some fields + a JSON, the format is the following:
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z";"2017-01-31T00:00:04.714Z";"{"requestedFrom":"BUTTON","tripId":{"request":3003926837969,"mac":"v01162450701"}}"
and (after many tries) I'm using the COPY command with a CSV parser in this way:
COPY schema.flex_table from local 'C:\temp/test.log' parser fcsvparser(delimiter=';',header=false, trim=true, type='traditional')
in this way all is loaded correctly except the JSON, that is skipped and left empty.
Is there a way to load also the JSON as a string?
HINT: just for test puposes, I noticed that if in the JSON I put a '\' before every '"' in the log, the loading runs smoothly, but unfortunately I cannot modify the content of the log.
Not without modifying the file beforehand - or writing your own UDParser function.
It clearly is a strange format: CSV (well, semicolon delimited and with double quotes as string enclosers), until the children appear - which are stored with a leading double quote and a trailing double quote; doubly nested with curly braces - JSON type, ok. But you have double quotes (not doubled) within the JSON encoding - any parser would go astray on those.
You'll have to write a program (ideally in C) to remove the curly braces, to remove the column names in the JSON code and leave just a CSV line
So, from the line (the backslash at the end means an escaped newline, meaning that the three lines you see are actually one line, for readability)
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z"; \
"2017-01-31T00:00:04.714Z"; \
"{"requestedFrom":"BUTTON","tripId":{"request":3003926837969,"mac":"v01162450701"}}"
you make (title line with column names, then data line)
col1;col2;col3;timestampz1;\
timestampz2;requestedfrom;tripid_request;tripid_mac
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z"; \
"2017-01-31T00:00:04.714Z";"BUTTON";3003926837969;"v01162450701"
Finally, you'll be able to load it as a CSV file - and maybe you will have to then normalise everything again: tripIdseems to be a dependent structure ....
Good luck
Marco the Sane

How to escape column names with hyphen in Spark SQL

I have imported a json file in Spark and convertd it into a table as
myDF.registerTempTable("myDF")
I then want to run SQL queries on this resulting table
val newTable = sqlContext.sql("select column-1 from myDF")
However this gives me an error because of the hypen in the name of the column column-1. How do I resolve this is Spark SQL?
Backticks (`) appear to work, so
val newTable = sqlContext.sql("select `column-1` from myDF")
should do the trick, at least in Spark v1.3.x.
Was at it for a bit yesterday, turns out there is a way to escape the (:) and a (.) like so:
Only the field containing (:) needs to be escaped with backticks
sqlc.select("select `sn2:AnyAddRq`.AnyInfo.noInfo.someRef.myInfo.someData.Name AS sn2_AnyAddRq_AnyInfo_noInfo_someRef_myInfo_someData_Name from masterTable").show()
I cannot comment as I have less than 50 reps
When you are referencing a json structure with struct.struct.field and there is a namespace present like:
ns2:struct.struct.field the backticks(`) does not work.
jsonDF = sqlc.read.load('jsonMsgs', format="json")
jsonDF.registerTempTable("masterTable")
sqlc.select("select `sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData.Name` AS sn2_AnyAddRq_AnyInfo_noInfo_someRef_myInfo_someData_Name from masterTable").show()
pyspark.sql.utils.AnalysisException: u"cannot resolve 'sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData.Name'
If I remove the sn2: fields, the query executes.
I have also tried with single quote ('), backslash (\) and double quotes("")
The only way it works if if I register another temp table on the sn2: strucutre, I am able access the fields within it like so
anotherDF = jsonDF.select("sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData")
anotherDF.registerTempTable("anotherDF")
sqlc.select("select Name from anotherDF").show()
This is what I do, which also works in Spark 3.x.
I define function litCol() at the top of my program (or in some global scope):
litCols = lambda seq: ','.join(('`'+x+'`' for x in seq)) # Accepts any sequence of strings.
And then apply it as necessary to prepare my literalized SELECT columns. Here's an example:
>>> UNPROTECTED_COLS = ["RegionName", "StateName", "2012-01", "2012-02"]
>>> LITERALIZED_COLS = litCols(UNPROTECTED_COLS)
>>> print(LITERALIZED_COLS)
`RegionName`,`StateName`,`2012-01`,`2012-02`
The problematic column names in this example are the YYYY-MM columns, which Spark will resolve as an expression, resulting in 2011 and 2010, respectively.