How to update column values in a spark dataframe - scala

I'm trying to load data from Elasticsearch to Mongo DB using Spark.
I'm collecting the data from ES into a dataframe and then pushing the DF into Mongo DB.
Now, I have a column named '_id' in my dataframe which has String value
i.e., _id: "abcd12345"
I would now like to modify this columnvalue to mongo ObjectId type and then push to Mongo DB,
i.e., _id: ObjectId("abcd12345")
I have tried achieving it using the following spark code, but I find no luck in getting what I want.
import spark.implicits._
val df = spark.read
.format("org.elasticsearch.spark.sql")
.option("query", esQuery)
.option("pushdown", true)
.option("scroll.size", Config.ES_SCROLL_SIZE)
.load(Config.ES_RESOURCE)
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
df("_id") = new ObjectId(df("_id").toString()) // error
Expect the resultant DataFrame to have '_id' column value as,
_id: ObjectId("abc12345") instead of _id: "abc12345"
Any help is appreciated. I'm really blocked with this issue.

Related

Spark DataFrame orderBy and DataFrameWriter sortBy, is there a difference?

Is there a difference in the output between sorting before or after the .write command on a DataFrame?
val people : DataFrame[Person]
people
.orderBy("name")
.write
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
and
val people : DataFrame[Person]
people
.write
.sortBy("name")
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
The different between those is explained on the comments within the code:
orderBy: Is a Dataset/Dataframe operation. Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
sortBy: Is a DataFrameWriter operation. Sorts the output in each bucket by the given columns.
The sortBy method will only work when you are also defining buckets (bucketBy). Otherwise you will get an exception:
if (sortColumnNames.isDefined && numBuckets.isEmpty) {
throw new AnalysisException("sortBy must be used together with bucketBy")
}
The columns defined in sortBy are used in the BucketSpec as sortColumnNames like shown below:
Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.
case class BucketSpec(
numBuckets: Int,
bucketColumnNames: Seq[String],
sortColumnNames: Seq[String])

Spark dataframe cast column for Kudu compatibility

(I am new to Spark, Impala and Kudu.) I am trying to copy a table from an Oracle DB to an Impala table having the same structure, in Spark, through Kudu. I am getting an error when the code tries to map an Oracle NUMBER to a Kudu data type. How can I change the data type of a Spark DataFrame to make it compatible with Kudu?
This is intended to be a 1-to-1 copy of data from Oracle to Impala. I have extracted the Oracle schema of the source table and created a target Impala table with the same structure (same column names and a reasonable mapping of data types). I was hoping that Spark+Kudu would map everything automatically and just copy the data. Instead, Kudu complains that it cannot map DecimalType(38,0).
I would like to specify that "column #1, with name SOME_COL, which is a NUMBER in Oracle, should be mapped to a LongType, which is supported in Kudu".
How can I do that?
// This works
val df: DataFrame = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
// This does not work
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala
// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu data type: SOME_COL BIGINT
Apparently, we can specify a custom schema when reading from a JDBC data source.
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
That worked. I was able to specify a customSchema like so:
col1 Long, col2 Timestamp, col3 Double, col4 String
and with that, the code works:
import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
.as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")

How to compare the schema of a dataframe read from an RDBMS table against the same table on Hive?

I created a dataframe by reading an RDBMS table from postgres as below:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
execQuery content: select qtd_balance_text,ytd_balance_text,del_flag,source_system_name,period_year from dbname.hrtable;
This is the schema of my final dataframe:
println(yearDF.schema)
StructType(StructField(qtd_balance_text,StringType,true), StructField(ytd_balance_text,StringType,true), StructField(del_flag,IntegerType,true), StructField(source_system_name,StringType,true), StructField(period_year,DecimalType(15,0),true))
There is a table on Hive with same name: hrtable and same column names. Before ingesting the data into the Hive table, I want to keep a check in the code to see if the schema of GP & Hive tables are same.
I was able to access the schema as following:
spark.sql("desc formatted databasename.hrtable").collect.foreach(println)
But the problem is it collects the schema in a different way
[ qtd_balance_text,bigint,null]
[ ytd_balance_text,string,null]
[ del_flag,string,null]
[source_system_name,bigint,null]
[ period_year,bigint,null]
[Type,MANAGED,]
[Provider,hive,]
[Table Properties,[orc.stripe.size=536870912, transient_lastDdlTime=1523914516, last_modified_time=1523914516, last_modified_by=username, orc.compress.size=268435456, orc.compress=ZLIB, serialization.null.format=null],]
[Location,hdfs://devenv/apps/hive/warehouse/databasename.db/hrtable,]
[Serde Library,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
[InputFormat,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Storage Properties,[serialization.format=1],]
[Partition Provider,Catalog,]
Clearly I cannot the schemas which are present in this way and I couldn't understand how to do it.
Could anyone let me know how to properly compare the schema of dataframe yearDF and the hive table: hrtable?
Instead of parsing of the Hive tables schema output, you can try this option.
Read Hive table also as dataframe. Assume this dataframe is df1 and your yearDF as df2. Then compare schemas as below.
If there is a possibility of column count also differs between two dataframes, then keep additional df1.size == df2.size comparison if loop also.
val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.
val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there
if(out.size == 0) { // size of `out` should be 0 if matching
println("matching")
}
else println("not matching")

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.