Dynamically framing column names to pass it for select while creating Spark Data frame - scala

I'm working in Spark and Scala for the past 2 months and I'm new to this technology. I framed the select columns(with regexp_replace) as List [String] () and passed for Spark Data frame creation and its throwing me error as "Cannot resolve". Please find below the steps, I have followed and tried.
Defining the val:
Defining the column which I would like to identify in the src data frame
val col_name = "region_id"
Defining the column which will be used to replace the src data frame column from ref data frame
val surr_key_col_name = "surrogate_key"
I have created two Data frames as shown below
src_df
region id | region_name | region_code
10001189 | Spain | SP09 8545
10001765 | Africa | AF97 6754
ref_df
region id | surrogate_key
1189 | 2345
1765 | 8978
val src_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/src_details.csv")
val ref_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/ref_details.csv")
I'm iterating through to identify the column I need to use reg match and replace with another Data Frame column value and storing it in List to pass it to Data Frames select
val src_header_rec = src_df.columns.toList
//Loop through source file header to identify the region_id and replace it with surrogate_id by doing a pattern match( I don't want to replace the
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+="regexp_replace("+"$"+s""""src.$src_header_cols""""+","+"$"+s""""ref.$src_header_cols""""+","+"$"+s""""ref.$surr_key_col_name""""+")"+".as("+s""""$src_header_cols""""+")"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
After building the select column in the List [String] () using the for loop above, I'm passing it to the select columns for final_df creation
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
If I directly pass the columns without using the List [String] () in the select of the data frame my regexp_replace substitution works
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(regexp_replace($"src.region_id",$"ref.region_id",$"ref.surrogate_key").as("region_id"))
I'm not sure why its not working when I'm passing it as a List [String] ()
When I remove the regexp_replace substitution in the for loop and pass it as List [String] () for Data Frame select it works properly as shown below:
This code works very well with Data Frame select:
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "ref."+surr_key_col_name
}
else {
src_column_names :+= "src."+src_header_cols
}
}
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)===ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
The result/output Data Frame I'm trying to derive is
final_df
region id | region_name | region_code
1000**2345** | Spain | SP09 8545
1000**8978** | Africa | AF97 6754
So, when I'm trying to build the Spark Data Frame select in the for loop with regexp_replace as a List and use it its throwing me "Cannot resolve" error.

I have tried creating Temporary view of the Data Frame and used the same regexp in the select statement of the Temporary view. It worked. Please find below the code I have tried and it worked.
//This for loop will scan through my header list and whichever column matches it frames regexp for those columns.So, the region_id from the Data Frame header matches the variable value that I have defined.
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "regexp_replace(src."+s"$src_header_cols"+",ref."+s"$ref_col_name"+",ref."+s"$surr_key_col_name"+")"+s" $src_header_cols"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
//Creating Temporary view to apply SQL queries on it
src_df.createOrReplaceTempView("src")
ref_df.createOrReplaceTempView("ref")
//Framing SQL statements to be passed while querying
val selectExpr_1 = "select "+src_column_names.mkString(",")
val selectExpr_2 = selectExpr_1+" from src left outer join ref on(src."+s"$col_name"+" = ref."+s"$ref_col_name"+")"
// Creating a final Data Frame using the SQL statement created
val src_policy_masked_df = spark.sql(s"$selectExpr_2")

Related

pyspark how to get the count of records which are not matching with the given date format

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.

Saving a Spark-SQL file as csv

I am trying to save the output of SparkSQL to a path but not sure what function to use. I want to do this without using spark data frames. I was trying using write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out") but not successful. Can someone tell how to do it?
Note: spark SQL will give the output in multiple files. Need to ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
case class Docword(docId: Int, vocabId: Int, count: Int)
case class VocabWord(vocabId: Int, word: String)
// Read the input data
val docwords = spark.read.
schema(Encoders.product[Docword].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/docword.txt").
as[Docword]
val vocab = spark.read.
schema(Encoders.product[VocabWord].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/vocab.txt").
as[VocabWord]
docwords.createOrReplaceTempView("docwords")
vocab.createOrReplaceTempView("vocab")
spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""").show(10)
write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
// Required to exit the spark-shell
sys.exit(0)
.show() returns void you should dp something like below
val writeDf = spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""")
writeDf.write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
writeDf.show() // this should not be used in prod environment

Fetching data from a List

I am new to PySpark and have purchased a book to enhance my PySpark skills. I am stuck while using a function.
Function
def filterDuplicates ( ( userID, ratings ) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
I am getting error due to two continuous parenthesis. Step basically gets an RDD which is basically a list of touple as show below:
[(196, ((242, 3.0), (242, 3.0))), (196, ((242, 3.0), (393, 4.0)))]
The final result should be only distinct movie ID, rating BY each viewer.
So in the above-given example, 196 is viewer ID, 242 is movie ID and 3.0 is rating given by viewer.
Kindly advise if I need to download a different version of python to use double parenthesis. Presently I have Python 3.7 installed on my machine.
Thanks,
AJ
The variable names inside a tuple is of no use. If you really want the tuple to be parameter of the function, name the whole tuple like
def filterDuplicates ( userData ):
userId = userData[0]
ratings = userData[1]
movie1 = ratings[0][0]
rating1 = ratings[0][1]
movie2 = ratings[1][0]
rating2 = ratings[1][1]
return movie1 < movie2

How to get table names from SQL query?

I want to get all the tables names from a sql query in Spark using Scala.
Lets say user sends a SQL query which looks like:
select * from table_1 as a left join table_2 as b on a.id=b.id
I would like to get all tables list like table_1 and table_2.
Is regex the only option ?
Thanks a lot #Swapnil Chougule for the answer. That inspired me to offer an idiomatic way of collecting all the tables in a structured query.
scala> spark.version
res0: String = 2.3.1
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
val query = "select * from table_1 as a left join table_2 as b on a.id=b.id"
scala> getTables(query).foreach(println)
table_1
table_2
Hope it will help you
Parse the given query using spark sql parser (spark internally does same). You can get sqlParser from session's state. It will give Logical plan of query. Iterate over logical plan of query & check whether it is instance of UnresolvedRelation (leaf logical operator to represent a table reference in a logical query plan that has yet to be resolved) & get table from it.
def getTables(query: String) : Seq[String] ={
val logical : LogicalPlan = localsparkSession.sessionState.sqlParser.parsePlan(query)
val tables = scala.collection.mutable.LinkedHashSet.empty[String]
var i = 0
while (true) {
if (logical(i) == null) {
return tables.toSeq
} else if (logical(i).isInstanceOf[UnresolvedRelation]) {
val tableIdentifier = logical(i).asInstanceOf[UnresolvedRelation].tableIdentifier
tables += tableIdentifier.unquotedString.toLowerCase
}
i = i + 1
}
tables.toSeq
}
I had some complicated sql queries with nested queries and iterated on #Jacek Laskowski's answer to get this
def getTables(spark: SparkSession, query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
var tables = new ListBuffer[String]()
var i: Int = 0
while (logicalPlan(i) != null) {
logicalPlan(i) match {
case t: UnresolvedRelation => tables += t.tableName
case _ =>
}
i += 1
}
tables.toList
}
def __sqlparse2table(self, query):
'''
#description: get table name from table
'''
plan = self.spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
plan_string = plan.toString().replace('`.`', '.')
unr = re.findall(r"UnresolvedRelation `(.*?)`", plan_string)
cte = re.findall(r"CTE \[(.*?)\]", plan.toString())
cte = [tt.strip() for tt in cte[0].split(',')] if cte else cte
schema = set()
tables = set()
for table_name in unr:
if table_name not in cte:
schema.update([table_name.split('.')[0]])
tables.update([table_name])
return schema, tables
Since you need to list all the columns names listed in table1 and table2, what you can do is to show tables in db.table_name in your hive db.
val tbl_column1 = sqlContext.sql("show tables in table1");
val tbl_column2 = sqlContext.sql("show tables in table2");
You will get list of columns in both the table.
tbl_column1.show
name
id
data
unix did the trick, grep 'INTO\|FROM\|JOIN' .sql | sed -r 's/.?(FROM|INTO|JOIN)\s?([^ ])./\2/g' | sort -u
grep 'overwrite table' .txt | sed -r 's/.?(overwrite table)\s?([^ ])./\2/g' | sort -u

How to pass where clause as variable to the query in Spark SQL?

Sample Table:
+---------------------------------------------------------++------+
| Name_Age || ID |
+---------------------------------------------------------++------+
|"SUBHAJIT SEN":28,"BINOY MONDAL":26,"SHANTANU DUTTA":35 || 15 |
|"GOBINATHAN SP":35,"HARSH GUPTA":27,"RAHUL ANAND":26 || 16 |
+---------------------------------------------------------++------+
How to pass WHERE clause as variable to the query?
My desired query is: Select Name_Age from table where ID=15 so where variable is ID=15.
If data is already registered as a table (A Hive table or after calling registerTempTable on a DataFrame), you can use SQLContext.sql:
val whereClause: String = "ID=15"
sqlContext.sql("Select Name_Age from table where " + whereClause)
If you have a df: DataFrame object you want to query:
// using a string filter:
df.filter(whereClause).select("Name_Age")
The below code can get you your answer:
spark.sql("""Select Name_Age from table where ID='15'""")