Spark Scala applying schema to RDD gives Scala match error - scala

I'm reading a file column by column and making it as a row and I will have to insert that into a table for which applying the schema to convert to a df. I have an integer value 2146411835(within int range) when I try to apply the schema with Integer, I get scala.MatchError: 2146411835 (of class java.lang.Integer). Any inputs?
Thanks,
Ash

Related

How did spark RDD map to Cassandra table?

I am new to Spark, and recently I saw a code is saving data in RDD format to Cassandra table. But I am not able to figure it out how it is doing the column mapping. It neither uses case class, also specifies any column names in the code like below:
rdd
.map(x => (x._1, x._2, x_3)) // x is a List here
.repartitionByCassandraReplica(keyspace, tableName)
.saveToCassandra(keyspace, tableName)
Since x inside is simply a List[(Int, String, Int)], which is not a case class, there is no name mapping to Cassandra table. So is there any definite order in Cassandra table that can match the order of columns we specify in the code?
This mapping relies on the order of columns in the Cassandra table definition that is as following:
Partition key columns in the specified order
Clustering columns in the specified order
Alphabetically sorted by name for rest of the columns
Spark Cassandra Connector relies that these columns from table definition will be matched to the order of fields in the Scala tuple. You can see that in the source code of TupleColumnMapper class.

Unable to read jsonb columns in Postgres as StructType in Spark

I am trying to create a Spark DataFrame by reading a Postgres table. Postgres table has some columns of type json and jsonb. Instead of parsing these columns as of type StructType, Spark is converting it as of type StringType. How can this be fixed ?

Spark Dataframe Write to Postgres Column Name in Double Quotes

I am trying to write a Dataframe into Postgres, where the column names in the Dataframe are codes in UPPERCASE. But the table in Postgres has column names in lowercase. When the
dataframe.write.jdbc -> java.sql.BatchUpdateException: Batch entry 0 INSERT INTO xxxxxxxx ("USE_CASE_ID","CUSTOMER_CODE","HOLDOUT","REFERENCE_ID","TAG_FIELDS","COMMS_RUN_ID","PRIMARY_OFFER_ID") VALUES
ERROR: column "USE_CASE_ID" of relation "xxxxxxxx" does not exist
How can I make this work for any database in the future. I am not sure why Spark SQL puts "" for column names?
Another problem is, for unit testing I used H2 database, which expects the column names to be in UPPERCASE. So i ll have to satisfy for multiple databases.

pySpark jdbc write error: An error occurred while calling o43.jdbc. : scala.MatchError: null

I am trying to write simple spark dataframe to db2 database using pySpark. Dataframe has only one column with double as a data type.
This is the dataframe with only one row and one column:
This is the dataframe schema:
When I try to write this dataframe to db2 table with this syntax:
dataframe.write.mode('overwrite').jdbc(url=url, table=source, properties=prop)
it creates the table in the database for the first time without any issue, but if I run the code second time, it throws an exception:
On the DB2 side the column datatype is also DOUBLE.
Not sure what am I missing.
I just changed small part in the code and it worked perfectly.
Here is the small change I did to syntax
dataframe.write.jdbc(url=url, table=source, mode = 'overwrite', properties=prop)

Spark DataFrame duplicate column names while using mergeSchema

I have a huge Spark DataFrame which I create using the following statement
val df = sqlContext.read.option("mergeSchema", "true").parquet("parquet/partitions/path")
Now when I try to do column rename or select operation on above DataFrame it fails saying ambiguous columns found with the following exception
org.apache.spark.sql.AnalysisException: Reference 'Product_Type' is
ambiguous, could be Product_Type#13, Product_Type#235
Now I saw columns and found there are two columns Product_Type and Product_type which seems to be same columns with one letter case different created because of schema merge over time. Now I don't mind keeping duplicate columns but Spark sqlContext for some reason don't like it.
I believe by default spark.sql.caseSensitive config is true so don't know why it fails. I am using Spark 1.5.2. I am new to Spark.
By default, spark.sql.caseSensitive property is false so before your rename or select statement, you should set the property to true
sqlContext.sql("set spark.sql.caseSensitive=true")