How to update few records in Spark - scala

i have the following program in Scala for the spark:
val dfA = sqlContext.sql("select * from employees where id in ('Emp1', 'Emp2')" )
val dfB = sqlContext.sql("select * from employees where id not in ('Emp1', 'Emp2')" )
val dfN = dfA.withColumn("department", lit("Finance"))
val dfFinal = dfN.unionAll(dfB)
dfFinal.registerTempTable("intermediate_result")
dfA.unpersist
dfB.unpersist
dfN.unpersist
dfFinal.unpersist
val dfTmp = sqlContext.sql("select * from intermediate_result")
dfTmp.write.mode("overwrite").format("parquet").saveAsTable("employees")
dfTmp.unpersist
when I try to save it, I get the following error:
org.apache.spark.sql.AnalysisException: Cannot overwrite table employees that is also being read from.;
at org.apache.spark.sql.execution.datasources.PreWriteCheck.failAnalysis(rules.scala:106)
at org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala:182)
at org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:111)
at org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:109)
at org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:105)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:218)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:218)
at scala.collection.immutable.List.foreach(List.scala:318)
My questions are:
Is my approach correct to change the department of two employees
Why am I getting this error when I have released the DataFrames

Is my approach correct to change the department of two employees
It is not. Just to repeat something that has been said multiple times on Stack Overflow - Apache Spark is not a database. It is not designed for fine grained updates. If your projects requires operation like this, use one of many databases on Hadoop.
Why am I getting this error when I have released the DataFrames
Because you didn't. All you've done is adding a name to the execution plan. Checkpointing would be the closest thing to "releasing", but you really don't want to end up in situation when you loose executor, in the middle of destructive operation.
You could write to temporary directory, delete input and move the temporary files, but really - just use a tool which is fit for the job.

Following is an approach you can try.
Instead of using registertemptable api, you can write it into an another table using the saveAsTable api
dfFinal.write.mode("overwrite").saveAsTable("intermediate_result")
Then, write it into employees table
val dy = sqlContext.table("intermediate_result")
dy.write.mode("overwrite").insertInto("employees")
Finally, drop intermediate_result table.

I would approach it this way,
>>> df = sqlContext.sql("select * from t")
>>> df.show()
+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
| 2| Fitness|
| 3| Footwear|
| 4| Apparel|
| 5| Golf|
| 6| Outdoors|
| 7| Fan Shop|
+-------------+---------------+
To mimic your flow, I creating 2 dataframes, doing union and writing back to same table t ( deliberately removing department_id = 4 in this example)
>>> df1 = sqlContext.sql("select * from t where department_id < 4")
>>> df2 = sqlContext.sql("select * from t where department_id > 4")
>>> df3 = df1.unionAll(df2)
>>> df3.registerTempTable("df3")
>>> sqlContext.sql("insert overwrite table t select * from df3")
DataFrame[]
>>> sqlContext.sql("select * from t").show()
+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
| 2| Fitness|
| 3| Footwear|
| 5| Golf|
| 6| Outdoors|
| 7| Fan Shop|
+-------------+---------------+

Lets say it is a hive table you are reading and overwriting.
Please introduce the timestamp to the hive table location as follows
create table table_name (
id int,
dtDontQuery string,
name string
)
Location hdfs://user/table_name/timestamp
As overwrite is not possible, We will write the output file to a new location.
Write the data to that new location using dataframe Api
df.write.orc(hdfs://user/xx/tablename/newtimestamp/)
Once Data is written alter the hive table location to new location
Alter table tablename set Location hdfs://user/xx/tablename/newtimestamp/

Related

Spark scala selecting multiple columns from a list and single columns

I'm attempting to do a select on a dataframe but I'm having a little bit of trouble.
I have this initial dataframe
+----------+-------+-------+-------+
|id|value_a|value_b|value_c|value_d|
+----------+-------+-------+-------+
And what I have to do is sum value_a with value_b and keep the others the same. So I have this list
val select_list = List(id, value_c, value_d)
and after this I do the select
df.select(select_list.map(col):_*, (col(value_a) + col(value_b)).as("value_b"))
And I'm expecting to get this:
+----------+-------+-------+
|id|value_c|value_d|value_b| --- that value_b is the sum of value_a and value_b (original)
+----------+-------+-------+
But i'm getting "a no _* annotation allowed here". Keep in mind that in reality I have a lot of columns so I need to use a list, I can't simply select each column. I'm running into this trouble because the new column that is the result of the sum has the same name of an existing column, so I can't just select(column("*"), sum....).drop(value_b) or I'd be dropping the old column and the new one with the sum.
What is the correct syntax to add multiple and single columns in a single select, or how else can I solve this?
for now I decided to do this:
df.select(col("*"), (col(value_a) + col(value_b)).as("value_b_tmp")).
drop("value_a", "value_b").withColumnRenamed("value_b_tmp", "value_b")
Which works fine but I understand the withColumn and withColumnRenamed is expensive because I'm creating pretty much a new dataframe with a new or renamed column and I'm looking for the less expensive operation possible.
Thanks in advance!
Simply use .withColumn function, it will replace the column if it exists:
df
.withColumn("value_b", col("value_a") + col("value_b"))
.select(select_list.map(col):_*)
You can create a new sum field and collect the result of the operation for the sum of the n columns as:
val df: DataFrame =
spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1,2,3),Row(1,2,3))),
StructType(List(
StructField("field1", IntegerType),
StructField("field2", IntegerType),
StructField("field3", IntegerType))))
val columnsToSum = df.schema.fieldNames
columnsToSum.filter(name => name != "field1")
.foldLeft(df.withColumn("sum", lit(0)))((df, column) =>
df.withColumn("sum", col("sum") + col(column)))
Gives:
+------+------+------+---+
|field1|field2|field3|sum|
+------+------+------+---+
| 1| 2| 3| 5|
| 1| 2| 3| 5|
+------+------+------+---+

How to compare the schema of two dataframes in SQL statement?

There are many ways to verify the schema of two data frames in spark like here. But I want to verify the schema of two data frames only in SQL, I mean SparkSQL.
Sample query 1:
SELECT DISTINCT target_person FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME IN ('columnA','ColumnB') AND TABLE_SCHEMA='ad_facebook'
Sample query 2:
SELECT count(*) FROM information_schema.columns WHERE table_name = 'ad_facebook'
I know that there is no concept of a database (schema) in spark, but I read about metastore that it contains schema information etc.
Can we write SQL queries like above in SparkSQL?
EDIT:
I am just checking why show create table is not working on spark sql, is it because it's a temp table?
scala> val df1=spark.sql("SHOW SCHEMAS")
df1: org.apache.spark.sql.DataFrame = [databaseName: string]
scala> df1.show
+------------+
|databaseName|
+------------+
| default|
+------------+
scala> val df2=spark.sql("SHOW TABLES in default")
df2: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]
scala> df2.show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | df| true|
+--------+---------+-----------+
scala> val df3=spark.sql("SHOW CREATE TABLE default.df")
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'df' not found in database 'default';
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:180)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
at org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:834)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
... 48 elided
Schema can be queried using DESCRIBE [EXTENDED] [db_name.]table_name
See https://docs.databricks.com/spark/latest/spark-sql/index.html#spark-sql-language-manual
Try this code of extracting each schema and compare. This compares name of column, datatype of column, nullable or not column.
val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.
val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there
if(out.size == 0) { // size of `out` should be 0 if matching
println("matching")
}
else println("not matching")
We can get the schema in 2 ways in SparkSQL.
Method 1:
spark.sql("desc db_name table_name").show()
This will display only top 20 rows which is exactly similar to the dataframe concept of df.show()
(meaning, any table with more than 20 columns - schema will be shown
only for first 20 columns)
For Ex:
+--------------------+---------+-------+
| col_name|data_type|comment|
+--------------------+---------+-------+
| col1| bigint| null|
| col2| string| null|
| col3| string| null|
+--------------------+---------+-------+
Method 2:
spark.sql("desc db_name table_name").collect().foreach(println)
This will display the complete schema of all the columns.
For Ex:
[col1,bigint,null]
[col2,string,null]
[col3,string,null]

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

scala_spark _dataframe_hivecontext

I have a hive table. It contain some one column as "query" and there are 4 records in it. I will read the hive using :
val query_hive=sqlContext.sql(s"select * from hive_query limit 1")
I need to use this query in another hive for the calculation.
I have tried this method:
val ouput=sqlContext.sql(s"$query_hive")
But I am getting an error. Can anybody suggest the solution for the same?
You can do this. You are not passing query correctly, just look it below:
scala> val query = "select * from src limit 1"
query: String = select * from src limit 1
scala> sql(s"""$query""").show
+---+-----+
|key|value|
+---+-----+
| 1| a|
+---+-----+
Thanks.

Reference column by id in Spark Dataframe

I have multiple duplicate columns (due to joins) If I try to call them by alias, I get an ambiguous reference error:
Reference 'customers_id' is ambiguous, could be: customers_id#13, customers_id#85, customers_id#130
Is there a way to reference a column in a Scala Spark Dataframe by it's order in the Dataframe or by numeric ID, not by an alias? Sanitized names suggest that columns do have an id assigned (13, 85, 130 in the example below)
LATER EDIT:
I found out that I can reference a specific column by the original dataframe it was in. But, while I can use OriginalDataframe.customer_id in select function, the withColumnRename function only accepts string alias so I cannot rename the duplicate column in the final dataframe.
So, I guess the end question is:
Is there a way to reference a column that has a duplicate alias, that works with all functions that require a string alias as argument?
LATER EDIT 2:
Renaming seemed to have worked via adding a new column and dropping one of the current ones:
joined_dataframe = joined_dataframe.withColumn("renamed_customers_id", original_dataframe("customers_id")).drop(original_dataframe("customers_id"))
But, I'd like to keep my question open:
Is there a way to reference a column that has a duplicate alias (so, using something other than alias) in a way that all functions which expect a string alias accept it?
One way to get out of such a situation would be to create a new Dataframe using the old one's rdd, but with a new schema, where you can name each column as you'd like. This, of course, requires you to explicitly describe the entire schema, including the type of each column. As long as the new schema you provides matches the number of columns, and the column types, of the old Dataframe - this should work.
For example - starting with a Dataframe with two columns named type we can rename them type1 and type2:
df.show()
// +---+----+----+
// | id|type|type|
// +---+----+----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+----+----+
val newDF = sqlContext.createDataFrame(df.rdd, new StructType()
.add("id", IntegerType)
.add("type1", StringType)
.add("type2", StringType)
)
newDF.show()
// +---+-----+-----+
// | id|type1|type2|
// +---+-----+-----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+-----+-----+
The main problem is join, ı use python.
h1.createOrReplaceTempView("h1")
h2.createOrReplaceTempView("h2")
h3.createOrReplaceTempView("h3")
joined1 = h1.join(h2, (h1.A == h2.A) & (h1.B == h2.B) & (h1.C == h2.C), 'inner')
Result dataframe columns:
A B Column1 Column2 A B Column3 ...
I don't like this , but join must be implement like this:
joined1 = h1.join(h2, [*argv], 'inner')
We assume argv = ["A", "B", "C"]
Result columns:
A B column1 column2 column3 ...