Saving a Spark-SQL file as csv - scala

I am trying to save the output of SparkSQL to a path but not sure what function to use. I want to do this without using spark data frames. I was trying using write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out") but not successful. Can someone tell how to do it?
Note: spark SQL will give the output in multiple files. Need to ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
case class Docword(docId: Int, vocabId: Int, count: Int)
case class VocabWord(vocabId: Int, word: String)
// Read the input data
val docwords = spark.read.
schema(Encoders.product[Docword].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/docword.txt").
as[Docword]
val vocab = spark.read.
schema(Encoders.product[VocabWord].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/vocab.txt").
as[VocabWord]
docwords.createOrReplaceTempView("docwords")
vocab.createOrReplaceTempView("vocab")
spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""").show(10)
write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
// Required to exit the spark-shell
sys.exit(0)

.show() returns void you should dp something like below
val writeDf = spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""")
writeDf.write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
writeDf.show() // this should not be used in prod environment

Related

how to pass the case statements in to .selectExpr("*", "all case statements") spark code, from external file

I have below case statement in sql file
note - it is just a sample statement and i saved it as col_sql.sql
"CASE WHEN a = 1 THEN ONE END AS INT_VAL"
, "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL"
In spark scala code
Im getting the col_sql.sql as per below
val col_file = "dir/path/col_sql.sql"
val col_query = readFile(col_file) --- It is internal converted as string using .mkString
Then passing it to my select query in spark code
.selectExpr("*", col_query )
Expectation --
My expectation is when my spark job is running the case statement should be passed in .selectExpr() function as it is given in sql file, like below it should be passed.
When manually running in spark2-shell it is working correctly but in spark2-summit job it throwing parserDriver error .
Kindly assit me on this.
.selectExpr("*", "CASE WHEN a = 1 THEN ONE END AS INT_VAL", "CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL")
Each argument in selectExpr should resolve to one column (see examples in the doc). In this case you will have to split the expression read from the file, e.g.:
// Example given the complete string, you could split already when reading the file
val col_query = "\"CASE WHEN a = 1 THEN ONE END AS INT_VAL\", \"CASE WHEN a = 'DE' THEN 'APHABET' AS STR_VAL\""
val cols_queries = col_query.split(",").map(x => x.trim().stripPrefix("\"").stripSuffix("\""))
df.selectExpr("*", cols_queries: _*) // to expand the list into arguments

How to execute hql file in spark with arguments

I have a hql file which accepts several arguments and I then in stand alone spark application, I am calling this hql script to create a dataframe.
This is a sample hql code from my script:
select id , name, age, country , created_date
from ${db1}.${table1} a
inner join ${db2}.${table2} b
on a.id = b.id
And in this is how I am calling it in my Spark script:
import scala.io.Source
val queryFile = `path/to/my/file`
val db1 = 'cust_db'
val db2 = 'cust_db2'
val table1 = 'customer'
val table2 = 'products'
val query = Source.fromFile(queryFile).mkString
val df = spark.sql(query)
When I am using this way, I am getting:
org.apache.spark.sql.catylyst.parser.ParserException
Is there a way to pass arguments directly to my hql file and then create a df out of the hive code.
Parameters can be injected with such code:
val parametersMap = Map("db1" -> db1, "db2" -> db2, "table1" -> table1, "table2" -> table2)
val injectedQuery = parametersMap.foldLeft(query)((acc, cur) => acc.replace("${" + cur._1 + "}", cur._2))

Dynamically framing column names to pass it for select while creating Spark Data frame

I'm working in Spark and Scala for the past 2 months and I'm new to this technology. I framed the select columns(with regexp_replace) as List [String] () and passed for Spark Data frame creation and its throwing me error as "Cannot resolve". Please find below the steps, I have followed and tried.
Defining the val:
Defining the column which I would like to identify in the src data frame
val col_name = "region_id"
Defining the column which will be used to replace the src data frame column from ref data frame
val surr_key_col_name = "surrogate_key"
I have created two Data frames as shown below
src_df
region id | region_name | region_code
10001189 | Spain | SP09 8545
10001765 | Africa | AF97 6754
ref_df
region id | surrogate_key
1189 | 2345
1765 | 8978
val src_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/src_details.csv")
val ref_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/ref_details.csv")
I'm iterating through to identify the column I need to use reg match and replace with another Data Frame column value and storing it in List to pass it to Data Frames select
val src_header_rec = src_df.columns.toList
//Loop through source file header to identify the region_id and replace it with surrogate_id by doing a pattern match( I don't want to replace the
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+="regexp_replace("+"$"+s""""src.$src_header_cols""""+","+"$"+s""""ref.$src_header_cols""""+","+"$"+s""""ref.$surr_key_col_name""""+")"+".as("+s""""$src_header_cols""""+")"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
After building the select column in the List [String] () using the for loop above, I'm passing it to the select columns for final_df creation
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
If I directly pass the columns without using the List [String] () in the select of the data frame my regexp_replace substitution works
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(regexp_replace($"src.region_id",$"ref.region_id",$"ref.surrogate_key").as("region_id"))
I'm not sure why its not working when I'm passing it as a List [String] ()
When I remove the regexp_replace substitution in the for loop and pass it as List [String] () for Data Frame select it works properly as shown below:
This code works very well with Data Frame select:
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "ref."+surr_key_col_name
}
else {
src_column_names :+= "src."+src_header_cols
}
}
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)===ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
The result/output Data Frame I'm trying to derive is
final_df
region id | region_name | region_code
1000**2345** | Spain | SP09 8545
1000**8978** | Africa | AF97 6754
So, when I'm trying to build the Spark Data Frame select in the for loop with regexp_replace as a List and use it its throwing me "Cannot resolve" error.
I have tried creating Temporary view of the Data Frame and used the same regexp in the select statement of the Temporary view. It worked. Please find below the code I have tried and it worked.
//This for loop will scan through my header list and whichever column matches it frames regexp for those columns.So, the region_id from the Data Frame header matches the variable value that I have defined.
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "regexp_replace(src."+s"$src_header_cols"+",ref."+s"$ref_col_name"+",ref."+s"$surr_key_col_name"+")"+s" $src_header_cols"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
//Creating Temporary view to apply SQL queries on it
src_df.createOrReplaceTempView("src")
ref_df.createOrReplaceTempView("ref")
//Framing SQL statements to be passed while querying
val selectExpr_1 = "select "+src_column_names.mkString(",")
val selectExpr_2 = selectExpr_1+" from src left outer join ref on(src."+s"$col_name"+" = ref."+s"$ref_col_name"+")"
// Creating a final Data Frame using the SQL statement created
val src_policy_masked_df = spark.sql(s"$selectExpr_2")

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.