unable to loop hive sql queries in spark.sql - scala

I am facing challenges with below code to loop hive sql queries in spark.sql.
def missing_pks(query: String) = {
//println(f"spark.sql( $query )")
spark.sql(query)
}
var hql_query_list_df=spark.sql("select distinct hql_qry from table where msr_nm='orders' and rgn_src='europe'")
var hql= hql_query_list_df.select('hql_qry).as[String].collect()
var hql_f=hql_query_list_df.map( "\"" + _ + "\"" )
hql_f.foreach(missing_pks)
here I am calling hive sql statements from table and load them as list then try to execute, unfortunately its not working. not sure what missing in my code. Interesting part is if the list was created manually with in spark shell code is working perfectly. It would be great if someone help me here.

Related

refreshing materialized views in quill

how do I execute queries that looks like this using scala Quill library?
REFRESH MATERIALIZED VIEW CONCURRENTLY transaction_view
Basically this can be archived by writing it in a quote
val q = quote { query[MyTable] }
val myQuery = quote { infix"REFRESH MATERIALIZED VIEW CONCURRENTLY {$q}".as[Query[MyTable]] }
Thanks to #deusaquilus
The answer from binkabir is almost correct. One last touch needed is to replace Query with Action, otherwise Quill will again generate a select, instead of just using the raw string.
val q = quote { query[MyTable] }
val myQuery = quote { infix"REFRESH MATERIALIZED VIEW CONCURRENTLY {$q}".as[Action[MyTable]] }

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

create table in phoenix from spark

Hi I need to create a table in Phoenix from a spark job . I have tried 2 ways below but none of them work, seems this is still not supported.
1) Dataframe.write still requires that the tables exists previously
df.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", schemaName.toUpperCase + "." + tableName.toUpperCase ).option("zkUrl", hbaseQuorum).save()
2) if we connect to phoenix thru JDBC, and try to execute the CREATE statemnt, then we get a parsing error (same create works in phoenix)
var ddlCode="create table test (mykey integer not null primary key, mycolumn varchar) "
val driver = "org.apache.phoenix.jdbc.PhoenixDriver"
val jdbcConnProps = new Properties()
jdbcConnProps.setProperty("driver", driver);
val jdbcConnString = "jdbc:phoenix:hostname:2181/hbase-unsecure"
sqlContext.read.jdbc(jdbcConnString, ddlCode, jdbcConnProps)
error:
org.apache.phoenix.exception.PhoenixParserException: ERROR 601 (42P00): Syntax error. Encountered "create" at line 1, column 15.
Anyone with similar challenges that managed to do it differently?
i have finally worked in a solution for this. Basically i think was wrong by trying to use SQLContext read method for this. I think this method is designed just to "read" data sources. The way to workaournd it has been basically to open a standard JDBC connection against Phoenix:
var ddlCode="create table test (mykey integer not null primary key, mycolumn varchar) "
val jdbcConnString = "jdbc:hostname:2181/hbase-unsecure"
val user="USER"
val pass="PASS"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(jdbcConnString, user, pass)
val statement = connection.createStatement()
statement.executeUpdate(ddlCode)

sqlite3 results set in swift returning extra data

I am trying to fetch a number of records from a sqlite3 database and load them into an array. The code I have written, which seems to function correctly at least as far as retrieving the correct number of records with the right values from the db is
while(results?.next() == true) {
println("Got a result")
var sname = results?.stringForColumn("surname")
var fname = results?.stringForColumn("firstname")
println("Retrieved \(sname) ,\(fname)")
}
The problem I have is that when I try to access the variables in the println statement what it yields is
Retrieved Optional("Smiles") ,Optional("Dick")
I have seemingly tried everything to get just the values but keep getting the Optional(" ") added. Any ideas?
Try this approach:
println("Retrieved \(sname!) ,\(fname!)")

Reading sql file using getResources in scala

I'm trying to read and execute a sql in SPARK SQL.
sqlContext.sql(scala.io.Source.fromInputStream(getClass.getResourceAsStream("/" + "dq.sql")).getLines.mkString(" ").stripMargin).take(1)
My sql is very long. When I run it straight way in spark shell , it runs fine. When I try to read this using getResourcesAsStream - I'm hitting
java.lang.RuntimeException: [1.10930] failure: end of input
A simple solution could be reading the sql at driver (using any file utility) and pass the variable like ssc.sql(sqlvar)
val stream : InputStream = getClass.getResourceAsStream("/filename.txt")
val readFile = scala.io.Source.fromInputStream( stream ).getLines
val spa = readFile.map(line => " " + line)
val spl = spa.mkString.split(";")
for (m1 <- spl) {
sqlContext.sql(m1)
}