I ran the following query in Hive and it successfully updated the column value in the table: select id, regexp_replace(full_name,'A','C') from table
But when I ran the same query from Spark SQL, it did not update the actual records
hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table")
but when I do a hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table").show() -- it displays A replaced with C successfully ... only in the display and not in the actual table
I tried to assign the result to another variable
val vFullName = hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table")
and then
vFullName.show() -- it displays the original values without replacement
How do I get the value replaced in the table from SparkSQL?
Related
I have a scenario that to read a column from DataFrame by using another column from same DataFrame through where condition and this value pass through as IN condition to select same value from another DataFrame and how can I achieve in spark DataFrame.
In SQL it will be like:
select distinct(A.date) from table A where A.key in (select B.key from table B where cond='D');
I tried like below:
val Bkey: DataFrame = b_df.filter(col("cond")==="D").select(col("key"))
I have table A data in a_df DataFrame and table B data in b_df DataFrame. How can I pass variable Bkey value to outer query and achieve in Spark?
You can do a semi join:
val result = a_df.join(b_df.filter(col("cond")==="D"), Seq("key"), "left_semi").select("date").distinct()
I am trying to insert record into a table using a variable but it is failing.
command:
val query = "INSERT into TABLE Feed_metadata_s2 values ('LOGS','RUN_DATE',{} )".format(s"$RUN_DATE")
spark.sql(s"query")
spark.sql("INSERT into TABLE Feed_metadata_s2 values ('LOGS','ExtractStartTimestamp',$ExtractStartTimestamp)")
error:
INSERT into TABLE Feed_metadata_s2 values ('SDEDLOGS','ExtractStartTimestamp',$ExtractStartTimestamp)
------------------------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
It seems you're confused with string interpolation... you need to put s before the last query so that the variable is substituted into the string. Also the first two lines can be simplified:
val query = s"INSERT into TABLE Feed_metadata_s2 values ('LOGS','RUN_DATE',$RUN_DATE)"
spark.sql(query)
spark.sql(s"INSERT into TABLE Feed_metadata_s2 values ('LOGS','ExtractStartTimestamp',$ExtractStartTimestamp)")
I have a few rows stored in a source table (as defined as $schema.$sourceTable in the UPDATE query below). This table has 3 columns: TABLE_NAME, PERMISSION_TAG_COL, PT_DEPLOYED
I have an update statement stored in a string like:
var update_PT_Deploy = s"UPDATE $schema.$sourceTable SET PT_DEPLOYED = 'Y' WHERE TABLE_NAME = '$tableName';"
My source table does have rows with TABLE_NAME as $tableName (parameter) as I inserted rows into this table using another function of my program. The default value of PT_DEPLOYED when I inserted the rows was specified as NULL.
I'm trying to execute update using JDBC in the following manner:
println(update_PT_Deploy)
val preparedStatement: PreparedStatement = connection.prepareStatement(update_PT_Deploy)
val row = preparedStatement.execute()
println(row)
println("row updated in table successfully")
preparedStatement.close()
The above piece of code does not throw any exception, but when I query my table in a tool like DBeaver, the NULL value of PT_DEPLOYED does not get updated to Y.
If I execute the same query as mentioned in update_PT_Deploy inside DBeaver, the query works and the table updates. I am sure I am following the correct steps..
I have a dataframe which needs to have a unique load timestamp column. no two records in the dataframe should have same value in this field.
I tried using inbuilt methods such CURRENT_TIMESTAMP etc but doesn't work. I even tried creating a udf to generate the timestamp as below
val generateUniqueTimestamp = udf(() => new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS").format(new java.util.Date()).toString)
var df = dataFrame.withColumn("LOAD_TS", generateUniqueTimestamp())
Lets say, three records in data frame, it should have an extra field with a timestamp
Actual result
rec1 ,2019-09-05 22:00:00:000
rec2 ,2019-09-05 22:00:00:000
rec3 ,2019-09-05 22:00:00:000
Expected result
rec1 ,2019-09-05 22:00:00:000
rec2 ,2019-09-05 22:00:00:001
rec3 ,2019-09-05 22:00:00:002
Below steps solved the issue for now, Might be a bad way to do it though.
Created a method and registered it with spark udf using.
spark.udf.register("uniquetimestamp", uniquetimestap(_: String))
Created an empty column using
.withColumn("LOAD_TS", lit(null: String))
Extracted all columns in dataframe and then iterated through them to generate SqlExpr. (all column goes in as it is expect for LOAD_TS
//(cast(uniquetimestamp(load_ts) as timestamp) as load_ts)
val df =dataframe.selectExpr(sqlCastingExpr.split(","): _*)
While writing data into hive partitioned table, I am getting below error.
org.apache.spark.SparkException: Requested partitioning does not match the tablename table:
I have converted my RDD to a DF using case class and then I am trying to write the data into the existing hive partitioned table. But I am getting his error and as per the printed logs "Requested partitions:" is coming as blank. Partition columns are coming as expected in the hive table.
spark-shell error :-
scala> data1.write.format("hive").partitionBy("category", "state").mode("append").saveAsTable("sampleb.sparkhive6")
org.apache.spark.SparkException: Requested partitioning does not match the sparkhive6 table:
Requested partitions:
Table partitions: category,state
Hive table format :-
hive> describe formatted sparkhive6;
OK
col_name data_type comment
txnno int
txndate string
custno int
amount double
product string
city string
spendby string
Partition Information
col_name data_type comment
category string
state string
Try with insertInto() function instead of saveAsTable().
scala> data1.write.format("hive")
.partitionBy("category", "state")
.mode("append")
.insertInto("sampleb.sparkhive6")
(or)
Register a temp view on top of the dataframe then write with sql statement to insert data into hive table.
scala> data1.createOrReplaceTempView("temp_vw")
scala> spark.sql("insert into sampleb.sparkhive6 partition(category,state) select txnno,txndate,custno,amount,product,city,spendby,category,state from temp_vw")