scala_spark _dataframe_hivecontext - scala

I have a hive table. It contain some one column as "query" and there are 4 records in it. I will read the hive using :
val query_hive=sqlContext.sql(s"select * from hive_query limit 1")
I need to use this query in another hive for the calculation.
I have tried this method:
val ouput=sqlContext.sql(s"$query_hive")
But I am getting an error. Can anybody suggest the solution for the same?

You can do this. You are not passing query correctly, just look it below:
scala> val query = "select * from src limit 1"
query: String = select * from src limit 1
scala> sql(s"""$query""").show
+---+-----+
|key|value|
+---+-----+
| 1| a|
+---+-----+
Thanks.

Related

Is there any way to subtract two dataframes with alphanumeric datatype. I tried using except but the count of records is not coming correctly

I am trying to subtract two data frames in scala and my datatypes are alphanumeric like I have a string as the data type for id column. I tried using except
df1.merge(
df2, how='outer', indicator=True
).query('_merge == "left_only"').drop('_merge', 1)
val df1 = Seq(("1","2019-04-03 14:45:00","1"),("2","2019-04-03 14:45:00","1"),("3","2019-04-03 14:45:00","1")).toDF("ID","Timestamp","RowNum")
val df2 = Seq(("2","2019-04-03 13:45:00","2"),("3","2019-04-03 13:45:00","2")).toDF("ID","Timestamp","RowNum")
val idDiff = df1.select("ID").except(df2.select("ID"))
val outputDF = df1.join(idDiff, "ID")
But nothing helps. I was not getting the correct count. Any help will be appreciated.
So the outputDF should contains only one record "1","2019-04-03 14:45:00","1" ?
I've ran your code and looks like it works, you can get same result with left_anti join.
val idDiff = df1.select("ID").except(df2.select("ID"))
val outputDF = df1.join(idDiff, "ID")
outputDF.show()
df1.join(df2,Seq("ID"),"left_anti").show()
+---+-------------------+------+
| ID| Timestamp|RowNum|
+---+-------------------+------+
| 1|2019-04-03 14:45:00| 1|
+---+-------------------+------+
+---+-------------------+------+
| ID| Timestamp|RowNum|
+---+-------------------+------+
| 1|2019-04-03 14:45:00| 1|
+---+-------------------+------+

Pass filters as parameter to Dataframe.filter function

I have a Dataframe userdf as
val userdf = sparkSession.read.json(sparkContext.parallelize(Array("""[{"id" : 1,"name" : "user1"},{"id" : 2,"name" : "user2"}]""")))
scala> userdf.show
+---+-----+
| id| name|
+---+-----+
| 1|user1|
| 2|user2|
+---+-----+
I want to retrieve user with id === 1 and same I can achieve using code like
scala> userdf.filter($"id"===1).show
+---+-----+
| id| name|
+---+-----+
| 1|user1|
+---+-----+
What I want to achieve is like
val filter1 = $"id"===1
userdf.filter(filter1).show
These filters are fetch from configuration files and I am trying to achieve a more complex query using this building block, something like
userdf.filter(filter1 OR filter2).filter(filter3).show
where filter1, filter2, filter3, AND and OR condition are fetched from configurations
Thanks
the filter method can also accept a string that it a sql expression.
this code should produce the same result
userdf.filter("id = 1").show
so you can just get that string from config

How to compare the schema of two dataframes in SQL statement?

There are many ways to verify the schema of two data frames in spark like here. But I want to verify the schema of two data frames only in SQL, I mean SparkSQL.
Sample query 1:
SELECT DISTINCT target_person FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME IN ('columnA','ColumnB') AND TABLE_SCHEMA='ad_facebook'
Sample query 2:
SELECT count(*) FROM information_schema.columns WHERE table_name = 'ad_facebook'
I know that there is no concept of a database (schema) in spark, but I read about metastore that it contains schema information etc.
Can we write SQL queries like above in SparkSQL?
EDIT:
I am just checking why show create table is not working on spark sql, is it because it's a temp table?
scala> val df1=spark.sql("SHOW SCHEMAS")
df1: org.apache.spark.sql.DataFrame = [databaseName: string]
scala> df1.show
+------------+
|databaseName|
+------------+
| default|
+------------+
scala> val df2=spark.sql("SHOW TABLES in default")
df2: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]
scala> df2.show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | df| true|
+--------+---------+-----------+
scala> val df3=spark.sql("SHOW CREATE TABLE default.df")
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'df' not found in database 'default';
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:180)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
at org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:834)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
... 48 elided
Schema can be queried using DESCRIBE [EXTENDED] [db_name.]table_name
See https://docs.databricks.com/spark/latest/spark-sql/index.html#spark-sql-language-manual
Try this code of extracting each schema and compare. This compares name of column, datatype of column, nullable or not column.
val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.
val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there
if(out.size == 0) { // size of `out` should be 0 if matching
println("matching")
}
else println("not matching")
We can get the schema in 2 ways in SparkSQL.
Method 1:
spark.sql("desc db_name table_name").show()
This will display only top 20 rows which is exactly similar to the dataframe concept of df.show()
(meaning, any table with more than 20 columns - schema will be shown
only for first 20 columns)
For Ex:
+--------------------+---------+-------+
| col_name|data_type|comment|
+--------------------+---------+-------+
| col1| bigint| null|
| col2| string| null|
| col3| string| null|
+--------------------+---------+-------+
Method 2:
spark.sql("desc db_name table_name").collect().foreach(println)
This will display the complete schema of all the columns.
For Ex:
[col1,bigint,null]
[col2,string,null]
[col3,string,null]

How to update few records in Spark

i have the following program in Scala for the spark:
val dfA = sqlContext.sql("select * from employees where id in ('Emp1', 'Emp2')" )
val dfB = sqlContext.sql("select * from employees where id not in ('Emp1', 'Emp2')" )
val dfN = dfA.withColumn("department", lit("Finance"))
val dfFinal = dfN.unionAll(dfB)
dfFinal.registerTempTable("intermediate_result")
dfA.unpersist
dfB.unpersist
dfN.unpersist
dfFinal.unpersist
val dfTmp = sqlContext.sql("select * from intermediate_result")
dfTmp.write.mode("overwrite").format("parquet").saveAsTable("employees")
dfTmp.unpersist
when I try to save it, I get the following error:
org.apache.spark.sql.AnalysisException: Cannot overwrite table employees that is also being read from.;
at org.apache.spark.sql.execution.datasources.PreWriteCheck.failAnalysis(rules.scala:106)
at org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala:182)
at org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$3.apply(rules.scala:109)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:111)
at org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:109)
at org.apache.spark.sql.execution.datasources.PreWriteCheck.apply(rules.scala:105)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:218)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:218)
at scala.collection.immutable.List.foreach(List.scala:318)
My questions are:
Is my approach correct to change the department of two employees
Why am I getting this error when I have released the DataFrames
Is my approach correct to change the department of two employees
It is not. Just to repeat something that has been said multiple times on Stack Overflow - Apache Spark is not a database. It is not designed for fine grained updates. If your projects requires operation like this, use one of many databases on Hadoop.
Why am I getting this error when I have released the DataFrames
Because you didn't. All you've done is adding a name to the execution plan. Checkpointing would be the closest thing to "releasing", but you really don't want to end up in situation when you loose executor, in the middle of destructive operation.
You could write to temporary directory, delete input and move the temporary files, but really - just use a tool which is fit for the job.
Following is an approach you can try.
Instead of using registertemptable api, you can write it into an another table using the saveAsTable api
dfFinal.write.mode("overwrite").saveAsTable("intermediate_result")
Then, write it into employees table
val dy = sqlContext.table("intermediate_result")
dy.write.mode("overwrite").insertInto("employees")
Finally, drop intermediate_result table.
I would approach it this way,
>>> df = sqlContext.sql("select * from t")
>>> df.show()
+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
| 2| Fitness|
| 3| Footwear|
| 4| Apparel|
| 5| Golf|
| 6| Outdoors|
| 7| Fan Shop|
+-------------+---------------+
To mimic your flow, I creating 2 dataframes, doing union and writing back to same table t ( deliberately removing department_id = 4 in this example)
>>> df1 = sqlContext.sql("select * from t where department_id < 4")
>>> df2 = sqlContext.sql("select * from t where department_id > 4")
>>> df3 = df1.unionAll(df2)
>>> df3.registerTempTable("df3")
>>> sqlContext.sql("insert overwrite table t select * from df3")
DataFrame[]
>>> sqlContext.sql("select * from t").show()
+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
| 2| Fitness|
| 3| Footwear|
| 5| Golf|
| 6| Outdoors|
| 7| Fan Shop|
+-------------+---------------+
Lets say it is a hive table you are reading and overwriting.
Please introduce the timestamp to the hive table location as follows
create table table_name (
id int,
dtDontQuery string,
name string
)
Location hdfs://user/table_name/timestamp
As overwrite is not possible, We will write the output file to a new location.
Write the data to that new location using dataframe Api
df.write.orc(hdfs://user/xx/tablename/newtimestamp/)
Once Data is written alter the hive table location to new location
Alter table tablename set Location hdfs://user/xx/tablename/newtimestamp/

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.