How to escape column names with hyphen in Spark SQL - scala

I have imported a json file in Spark and convertd it into a table as
myDF.registerTempTable("myDF")
I then want to run SQL queries on this resulting table
val newTable = sqlContext.sql("select column-1 from myDF")
However this gives me an error because of the hypen in the name of the column column-1. How do I resolve this is Spark SQL?

Backticks (`) appear to work, so
val newTable = sqlContext.sql("select `column-1` from myDF")
should do the trick, at least in Spark v1.3.x.

Was at it for a bit yesterday, turns out there is a way to escape the (:) and a (.) like so:
Only the field containing (:) needs to be escaped with backticks
sqlc.select("select `sn2:AnyAddRq`.AnyInfo.noInfo.someRef.myInfo.someData.Name AS sn2_AnyAddRq_AnyInfo_noInfo_someRef_myInfo_someData_Name from masterTable").show()

I cannot comment as I have less than 50 reps
When you are referencing a json structure with struct.struct.field and there is a namespace present like:
ns2:struct.struct.field the backticks(`) does not work.
jsonDF = sqlc.read.load('jsonMsgs', format="json")
jsonDF.registerTempTable("masterTable")
sqlc.select("select `sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData.Name` AS sn2_AnyAddRq_AnyInfo_noInfo_someRef_myInfo_someData_Name from masterTable").show()
pyspark.sql.utils.AnalysisException: u"cannot resolve 'sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData.Name'
If I remove the sn2: fields, the query executes.
I have also tried with single quote ('), backslash (\) and double quotes("")
The only way it works if if I register another temp table on the sn2: strucutre, I am able access the fields within it like so
anotherDF = jsonDF.select("sn2:AnyAddRq.AnyInfo.noInfo.someRef.myInfo.someData")
anotherDF.registerTempTable("anotherDF")
sqlc.select("select Name from anotherDF").show()

This is what I do, which also works in Spark 3.x.
I define function litCol() at the top of my program (or in some global scope):
litCols = lambda seq: ','.join(('`'+x+'`' for x in seq)) # Accepts any sequence of strings.
And then apply it as necessary to prepare my literalized SELECT columns. Here's an example:
>>> UNPROTECTED_COLS = ["RegionName", "StateName", "2012-01", "2012-02"]
>>> LITERALIZED_COLS = litCols(UNPROTECTED_COLS)
>>> print(LITERALIZED_COLS)
`RegionName`,`StateName`,`2012-01`,`2012-02`
The problematic column names in this example are the YYYY-MM columns, which Spark will resolve as an expression, resulting in 2011 and 2010, respectively.

Related

Renaming columns with ' in pyspark

How to rename column "RANDY'S" to 'RANDYS' in pyspark?
I tried below code and its not working
test_rename_df=df.withColumnRenamed('"RANDY''S"','RANDYS')
Note that original column name has double quotes around it
enter image description here
You're adding too many quotes around the original column name. Try this:
test_rename_df = df.withColumnRenamed("RANDY\'S", "RANDYS")
Side-note
When you call df.columns, the column RANDY'S is surrounded by double quotes instead of single quotes to avoid confusion.
If your column had the name RANDY"S, df.columns would instead use single quotes around the column name (see screenshot below):

How can I use REGEX_REPLACE in pyspark SQL to remove \n and \r from column

I'm trying to read data from ScyllaDB and want to remove \n and \r character from a column. The problem is that these characters are stored as string in the column of a table being read and I need to use REGEX_REPLACE as I'm using Spark SQL for this. The regex pattern don't seem to work which work in MySQL. The string becomes blank but doesn't remove the characters. Below is the snippet of the query being used in Spark SQL. Help appreciated.
The following string is present in the message column: 'hello\nworld\r'
The expected output is 'hello world'
df=spark.sql("select REGEXP_REPLACE(message,'\n|\r|\r\n',' ') as replaced_message from delivery_sms")
Thanks Andrew's for the answer.
The following worked for me:
df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show()
If anybody is looking for a modern solution to this problem:
df.select(
F.translate(F.col("test"), "\n\t", " " * len("\n\t")).alias("test")
)

Getting Python to accept a csv into postgreSQL table with ":" in the headers

I receive a .csv export every 10 minutes that I'd like to import into a postgreSQL server. Working with a test csv, I got everything to work, but didn't take notice that my actual csv file has a forced ":" at the end of each column header (but not on the first header for some reason)(built into the back-end of the exporter, so I cant get it removed, already asked the company). So I added the ":"s to my test csv as shown in the link,
My insert into functions no longer work and give me syntax errors. First I'm trying to add them using the following code,
print("Reading file contents and copying into table...")
with open('C:\\Users\\admin\\Desktop\\test2.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
columns = next(readCSV) #skips the header row
query = 'insert into test({0}) values ({1})'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
for data in readCSV:
cursor.execute(query, data)
con.commit()
Resulting in '42601' error near ":" in the second column header.
The results are the same while actually listing column headers and ? ? ?s out in the INSERT INTO section.
What is the syntax to get the script to accept ":" on column headers? If there's no way, is there a way to scan through headers and remove the ":" at the end of each?
Because : is a special character, if your column is named year: in the DB, you must double quote its name --> select "year:" from test;
You are getting a PG error because you are referencing the unquoted column name (insert into test({0})), so add double quotes there.
query = 'insert into test("year:","day:", "etc:") values (...)'
That being said, it might be simpler to remove every occurrence of : in your csv's 1st line
Much appreciated JGH and Adrian. I went with your suggestion to remove every occurrence of : by adding the following line after the first columns = ... statement
columns = [column.strip(':') for column in columns]
It worked well.

SalesForce Spark Delimiter issue

I have a glue job, in which am reading table from SF using soql:
df = (
spark.read.format("com.springml.spark.salesforce")
.option("soql", sql)
.option("queryAll", "true")
.option("sfObject", sf_table)
.option("bulk", bulk)
.option("pkChunking", pkChunking)
.option("version", "51.0")
.option("timeout", "99999999")
.option("username", login)
.option("password", password)
.load()
)
and whenever there is a combination of double-quotes and commas in the string it messes up my table schema, like so:
in source:
Column A
Column B
Column C
000AB
"text with, comma"
123XX
read from SF in df :
Column A
Column B
Column C
000AB
"text with
comma"
Is there any option to avoid such cases when this comma is treated as a delimiter? I tried various options but nothing worked. And SOQL doesn't accept REPLACE or SUBSTRING functions, their text manipulation functions are, well, basically there aren't any.
All the information I'm giving need to be tested. I do not have the same env so it is difficult for me to try anything but here is what I foud.
When you check the official doc, you find that there is a field metadataConfig. The documentation of this field can be found here : https://resources.docs.salesforce.com/sfdc/pdf/bi_dev_guide_ext_data_format.pdf
On page 2, csv format, it says :
If a field value contains a control character or a new line the field value must be contained within double quotes (or your
fieldsEscapedBy value). The default control characters (fieldsDelimitedBy, fieldsEnclosedBy,
fieldsEscapedBy, or linesTerminatedBy) are comma and double quote. For example, "Director of
Operations, Western Region".
which kinda sounds like you current problem.
By default, the values are comma and double quotes, so, I do not understand why it is failing. But, apparently, in your output, it keeps the double quotes, so, maybe, it considers only simple quote.
You should try to enforce the format and add in you code :
.option("metadataConfig", '{"fieldsEnclosedBy": "\"", "fieldsDelimitedBy": ","}')
# Or something similar - i could'nt test, so you need to try by yourself

Pyspark : Reading csv files with fields having double quotes and comas

I have a csv file which I am reading thru pyspark and loading into postgresql. One of its field is having strings which have coma and double quotes within the string. Like example below -
1. "RACER ""K"", P.L. 9"
2. "JENIS, B. S. ""N"" JENIS, F. T. ""B"" 5"
Pyspark is parsing it as below. Which is causing issue because it is mixing up the values/columns when I load the data into postgresql and script fail.
1. '\"RACER \"\"K\"\"'
2. '\"JENIS, B. S. \"\"N\"\" JENIS'
I am using spark 2.42. How can this situation be handled in pyspark?
Basically I want to program to ignore coma or double quotes if it is coming inside double quotes.
You can try and remove the comma and double quotes using pandas before reading and loading into postgresql.
You can use str.replace:
df['column_name'] = df['column_name'].str.replace(r"[\"\',]", '')