Schema capitalization(uppercase) problem when reading with Spark - scala

Using Scala here:
Val df = spark.read.format("jdbc").
option("url", "<host url>").
option("dbtable", "UPPERCASE_SCHEMA.table_name").
option("user", "postgres").
option("password", "<password>").
option("numPartitions", 50).
option("fetchsize", 20).
load()
The database I'm using the above code to call from has many schemas and they are all in uppercase letters (UPPERCASE_SCHEMA).
No matter how I try to denote that the schema is in all caps, Spark converts it to lowercase which fails to initialize with the actual DB.
I've tried making it a variable and explicitly denoting it is all uppercase, etc. in multiple languages, but no luck.
Would anyone know a workaround?
When I went into the actual DB (Postgres) and temporarily changed the schema to all lowercase, it worked absolutely fine.

Try to set spark.sql.caseSensitive to true (false by default)
spark.conf.set('spark.sql.caseSensitive', true)
You can see in the source code its definition:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L833
In addition, you can see in the JDBCWriteSuite how it affects the JDBC connector:
https://github.com/apache/spark/blob/ee95ec35b4f711fada4b62bc27281252850bb475/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

Related

Scala Spark - Cannot resolve a column name

This should be pretty straightforward, but I'm having an issue with the following code:
val test = spark.read
.option("header", "true")
.option("delimiter", ",")
.csv("sample.csv")
test.select("Type").show()
test.select("Provider Id").show()
test is a dataframe like so:
Type
Provider Id
A
asd
A
bsd
A
csd
B
rrr
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve '`Provider Id`' given input columns: [Type, Provider Id];;
'Project ['Provider Id]
It selected and shows the Type column just fine but couldn't get it to work for the Provider Id. I wondered if it were because the column name had a space, so I tried using backticks, removing and replacing the space, but nothing seemed to work. Also, it ran fine when I'm using Spark libraries 3.x but doesn't work when I'm using Spark 2.1.x (meanwhile I need to use 2.1.x)
Additional: I tried changing the CSV column order from Type - Provider Id to Provider Id then Type. The error was the opposite, Provider Id shows but for Type it's throwing an exception now.
Any suggestions?
test.printSchema()
You can use the result from printSchema() to see how exactly spark read your column in, then use that in your code.

Change the format of file path which is partitioned by java.sql.Timestamp

We are using spark as a data processing platform and Scala programming language. When we write data on storage account(ADLS gen 2), we partition the data by datetime column which is of type java.sql.Timestamp. We write the data using spark dataframe.write operation
By default, it creates following path on storage account and writes parquet files in it
Path - __datetime=a/b/c/yyyy-MM-dd HH%3Amm%3Ass
The problem is, it has encoded : but not space and because the URL is not fully encoded, it creates problems for us. Is there a fix to this problem?
Can I change the format of a column(of type java.sql.Timestamp), so that the output file path looks like this which does not have any encoding?
__datetime=a/b/c/yyyy-MM-dd-HH-mm-ss
or
__datetime=a/b/c/yyyy_MM_dd_HH_mm_ss
Is it possible to do this within java.sql.Timestamp object and without converting it to a string?
Thanks
You can change the name / type dataframe column with a simple select + alias.
The encoding is necessary, though because file paths cannot have : characters, but they can have spaces... Unclear why you need full URL encoding

Special characters when creating a table via jdbc

I am trying to export a table from cluster to oracle using jdbs:
df.write.mode('append').option("createTableColumnTypes", "S_ID INT, BREAK_$ FLOAT").jdbc(jdbc, ORACLE_TABLENAME, properties = properties)
I am getting the following error:
As I understand it, this is due to the fact that the $ symbol is a special character. How to solve this problem if you can not refuse such a column name? Tried to use 'BREAK_$', {BREAK_$} but it didn't work

Filter spark DataFrame on string contains

I am using Spark 1.3.0 and Spark Avro 1.0.0.
I am working from the example on the repository page. This following code works well
val df = sqlContext.read.avro("src/test/resources/episodes.avro")
df.filter("doctor > 5").write.avro("/tmp/output")
But what if I needed to see if the doctor string contains a substring? Since we are writing our expression inside of a string. What do I do to do a "contains"?
You can use contains (this works with an arbitrary sequence):
df.filter($"foo".contains("bar"))
like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence):
df.filter($"foo".like("bar"))
or rlike (like with Java regular expressions):
df.filter($"foo".rlike("bar"))
depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.
In pyspark,SparkSql syntax:
where column_n like 'xyz%'
might not work.
Use:
where column_n RLIKE '^xyz'
This works perfectly fine.

Use consistency level in Phantom-dsl and Cassandra

Currently using --
cqlsh> show version
[cqlsh 4.1.1 | Cassandra 2.0.17 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Using phantom-dsl 1.12.2 , Scala 2.10 ..
I can't figure out how to set consistency levels on queries.
There are predefined functions insert() , select() as part of CassandraTable .. How can I pass the consistency level to them ?
insert.value(....).consistencyLevel_=(ConsistencyLevel.QUORUM)
does not work and fails with an error ( probably because this appends a "USING CONSISTENCY QUORUM" at the end of the query). Here's the actual exception I get
com.datastax.driver.core.exceptions.SyntaxError: line 1:424 no viable alternative at input 'CONSISTENCY'
at com.datastax.driver.core.Responses$Error.asException(Responses.java:122) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:120) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:186) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:45) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:754) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:576) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
I see from the documentation and discussion on this pull request that I could do a setConsistencyLevel(ConsistencyLevel.QUORUM) on a SimpleStatement, but I would prefer not rewrite all the different insert statements.
UPDATE
Just to close the loop on this issue. I worked around this by creating a custom InsertQuery and then using that instead of the one provided by final def insert in CassandraTable
def qinsert()(implicit keySpace: KeySpace) = {
val table = this.asInstanceOf[T]
new InsertQuery[T, M, Unspecified](table, CQLQuery("INSERT into keyspace.tablename", consistencyLevel = ConsistencyLevel.QUORUM)
}
First of all there is no setValue method inside phantom and the API method you are using is missing an = at the end.
The correct structure is:
Table.insert
.value(_.name, "test")
.consistencyLevel_=(ConsistencyLevel.Quorum)
As you are on stackoverflow, an error stack trace and specific details of what doesn't work is generally preferable to "does not work".
I have finally figured out how to properly set the consistency level using phantom-dsl.
Using a statement you can do the following:
statement.setConsistencyLevel(ConsistencyLevel.QUORUM)
Also, take a look on the test project I've been working to help guys like you with examples using phantom-dsl:
https://github.com/iamthiago/cassandra-phantom