How to execute hql files with multiple SQL queries per single file? - scala

I have hql file which have a lot of hive queries and I would like to execute the whole file using Spark SQL.
This is what I have tried.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Usually to execute individual queries we do it this way:
sqlContext.sql("SELECT * from table")
However, when we have hql file with hundreds of queries, I use to do like this.
import scala.io.Source
val filename = "/path/to/file/filename.hql"
for (line <- Source.fromFile(filename).getLines) {
sqlContext.sql(line)
}
However, I get an error saying:
NoViableAltException
This is the top of the file.
DROP TABLE dat_app_12.12_app_htf;
CREATE EXTERNAL TABLE dat_app_12.12_app_htf(stc string,
ftg string,
product_type string,
prod_number string,
prod_ident_number string,
prod_family string,
frst_reg_date date, gth_name string,
address string,
tel string,
maker_name string) ROW format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'file_location';
When the queries are multi-line queries like the above one, it doesn't work.
However, when I format the queries and put all the lines into one line, it works.
CREATE EXTERNAL TABLE dat_app_12.12_app_htf(stc string, ftg string, product_type string, prod_number string, prod_ident_number string, prod_family string, frst_reg_date date, gth_name string, address string, tel string, maker_name string) ROW format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' stored AS inputformat 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'file_location';
But I have thousands of lines like this. What is the proper way to do it.
Can anyone help in getting this solved.

tl;dr I don't think it's possible.
Spark SQL uses AstBuilder as the ANTLR-based SQL parser and accepts a single SQL statement at a time (see SqlBase.g4 for the full coverage of all supported SQL queries).
With that said, the only way to do it is to parse the multi-query input file yourself before calling Spark SQL's sqlContext.sql (or spark.sql as of Spark 2.0).
You could rely on empty lines as separators perhaps, but that depends on how input files are structured (and they could easily use semicolon instead).
In your particular case I've noticed that the end markers are actually semicolons.
// one query that ends with semicolon
DROP TABLE dat_app_12.12_app_htf;
// another query that also ends with semicolon
CREATE EXTERNAL TABLE dat_app_12.12_app_htf(stc string,
ftg string,
product_type string,
prod_number string,
prod_ident_number string,
prod_family string,
frst_reg_date date, gth_name string,
address string,
tel string,
maker_name string) ROW format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'file_location';
If that's consistent, you could parse the file line by line (as you do with for expression) and read until ; is found. Multiple-line SQL queries are fine for Spark SQL and so you should have your solution.
I had a similar use case in a project and simply gave up trying to parse all the possible ways people write SQLs before I could hand it over to Spark.

Hey Did you try with this command
spark-sql –master yarn-client –conf spark.ui.port=port -i /hadoop/sbscript/hql_for_dml.hql

Related

How to get the SQL representation for the query logic of a (derived) Spark DataFrame?

One can convert a raw SQL string into a DataFrame. But is it also possible the other way around, i.e., get the SQL representation for the query logic of a (derived) Spark DataFrame?
// Source data
val a = Seq(7, 8, 9, 7, 8, 7).toDF("foo")
// Query using DataFrame functions
val b = a.groupBy($"foo").agg(count("*") as "occurrences").orderBy($"occurrences")
b.show()
// Convert a SQL string into a DataFrame
val sqlString = "SELECT foo, count(*) as occurrences FROM a GROUP BY foo ORDER BY occurrences"
a.createOrReplaceTempView("a")
val c = currentSparkSession.sql(sqlString)
c.show()
// "Convert" a DataFrame into a SQL string
b.toSQLString() // Error: This function does not exist.
It is not possible to "convert" a DataFrame into an SQL string because Spark does not know how to write SQL queries and it does not need to.
I find it useful to recall how a Dataframe code or an SQL query gets handled by Spark. This is done by Spark's Catalyst Optimizer and it goes through four transformational phases as shown below:
In the first phase (Analysis), the Spark SQL engine generates an abstract syntax tree (AST) for the SQL or Dataframe query. This tree is the main data type in Catalyst (see section 4.1 in white paper Spark SQL: Relational Data Processing in Spark) and it is used to create the logical plan and eventually the physical plan. You get an representation of those plans if you use the explain API that Spark offers.
Although it is clear to me what you mean with "One can convert a raw SQL string into a DataFrame" I guess it helps to be more precise. We are not converting an SQL string (hence you put quotations around that word yourself) into a Dataframe, but you applied your SQL knowledge as this is a syntax that can be parsed by Spark to understand your intentions. In addition, you cannot just type in any SQL query as this could still fail in the Analysis phase when it comes to the comparison with the Catalog. So, the SQL string is just an agreement on how Spark allows you to give instructions. This SQL query then gets parsed, transformed into an AST (as described above) and after going through the other three phases ending up in a RDD-based code. The result of this SQL execution through the sql API returns a Dataframe, whereas you can easily transform it into an RDD with df.rdd.
Overall, there is no need for Spark to write any code and in particular any Dataframe code into an SQL syntax which you could then get out of Spark. The AST is the internal abstraction and it is not required for Spark to convert Dataframe code first to an SQL query instead of directly converting the Dataframe code into an AST.
No. There is no method that can get the SQL query from a dataframe.
You will have to create the query yourself by looking at all the filters and select you used to create the dataframe.

Spark/Scala - Validate JSON document in a row of a streaming DataFrame

I have a streaming application which is processing a streaming DataFrame with column "body" that contains a JSON string.
So in the body is something like (these are four input rows):
{"id":1, "ts":1557994974, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3,"attr3":"something"}]}
{"id":2, "ts":1557994975, "details":[{"id":1,"attr2":"3","attr3":"something"}, {"id":2,"attr2":"3","attr3":"something"},{"id":3,"attr2":"3","attr3":"something"}]}
{"id":3, "ts":1557994976, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3}]}
{"id":4, "ts":1557994977, "details":[]}
I would like to check that each row has the correct schema (data types and contains all attributes). I would like to filter out and log the invalid records somewhere (like a Parquet file). I am especially interested in the "details" array - each of the nested documents must have specified fields and correct data types.
So in the example above only row id = 1 is valid.
I was thinking about a case class such as:
case class Detail(
id: Int,
attr2: Int,
attr3: String
)
case class Input(
id: Int,
ts: Long,
details: Seq[Detail]
)
and Try but not sure how to go about it.
Could someone help, please?
Thanks
One approach is to use JSON Schema that can help you with schema validations on the data. The getting started page is a good place to start off with if you're new.
The other approach would roughly work as follows
Build models (case classes) for each of the objects like you've attempted in your question.
Use a JSON library like Spray JSON / Play-JSON to parse the input json.
For all input that fail to be parsed into valid records mostly likely invalid and you can partition those output into a different sink in your spark code. It would also make this robust if you've an isValid method on the objects which can validate if a parsed record is correct or not too.
The easiest way for me is to create a Dataframe with a Schema and then filter with id == 1. This is not the most efficient way.
Heare you can find a example to create a dataframe with Schema: https://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/
Edit
I can't find a Pre-filtering to speed up JSON search in scala, but you can use this three options:
spark.read.schema(mySchema).format(json).load("myPath").filter($"id" === 1)
or
spark.read.schema(mySchema).json("myPath").filter($"id" === 1)
or
spark.read.json("myPath").filter($"id" === 1)

How to read Hive table with column with JSON strings?

I have a hive table column (Json_String String) it has some 1000 rows, Where each row is a Json of same structure. I am trying read the json in to Dataframe as below
val df = sqlContext.read.json("select Json_String from json_table")
but it is throwing up the below exception
java.io.IOException: No input paths specified in job
is there any way to read all the rows in to dataframe as we do with Json files using wild card
val df = sqlContext.read.json("file:///home/*.json")
I think what you're asking for is to read the Hive table as usual and transform the JSON column using from_json function.
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
Given you use sqlContext in your code, I'm afraid that you use Spark < 2.1.0 which then does not offer from_json (which was added in 2.1.0).
The solution then is to use a custom user-defined function (UDF) to do the parsing yourself.
val df = sqlContext.read.json("select Json_String from json_table")
The above won't work since json operator expects a path or paths to JSON files on disk (not as a result of executing a query against a Hive table).
json(paths: String*): DataFrame Loads a JSON file (JSON Lines text format or newline-delimited JSON) and returns the result as a DataFrame.

Unable to work with Parquet data having columns with forward slash in Spark SQL

I have a Parquet file, I am able to load the parquet file in Spark SQL. But Parquet files have lots of columns with forward slash that is causing problem when I am trying a to get a data from table using those columns.
e.g. columns names: abc/def/efg/hij
parqfile.registerTempTable("parquetTable")
val result=sqlContext.sql("select abc/def/efg/hij from parquetTable")
throwing below error.
org.apache.spark.sql.AnalysisException: cannot resolve 'abc' given input columns
The slash is a reserved character, you'll need to quote the column name in your SELECT using backticks, as follows:
val result=sqlContext.sql("select `abc/def/efg/hij` from parquetTable")

Edit csv file in Scala

I would like to edit csv (more than 500MB) file.
If I have data like
ID, NUMBER
A, 1
B, 3
C, 4
D, 5
I want to add some extra column like
ID, NUMBER, DIFF
A, 1, 0
B, 3, 2
C, 4, 1
D, 5, 1
This data also be able in ScSla data type.
(in)Orgin Csv file -> (out)(new csv file, file data(RDD type?))
Q1. Which is best way to treat data?
make a new csv file from the original csv file, and then re-open the new csv file to scala data.
make new scala data first and make it as csv file.
Q2. Do I need to use 'dataframe' for this? Which library or API should I use?
A fairly trivial way to achieve that is to use kantan.csv:
import kantan.csv.ops._
import kantan.csv.generic.codecs._
import java.io.File
case class Output(id: String, number: Int, diff: Int)
case class Input(id: String, number: Int)
val data = new File("input.csv").asUnsafeCsvReader[Input](',', true)
.map(i => Output(i.id, i.number, 1))
new File("output.csv").writeCsv[Output](data.toIterator, ',', List("ID", "NUMBER", "DIFF"))
This code will work regardless of the data size, since at no point do we load the entire dataset (or, indeed, more than one row) in memory.
Note that in my example code, data comes from and goes to File instances, but it could come from anything that can be turned into a Reader instance - a URI, a String...
RDD vs DataFrame: both are good options. The recommendation is to use DataFrames which allows some extra optimizations behind the scenes, but for simple enough tasks the performance is probably similar. Another advantage of using DataFrames is the ability to use SQL - if you're comfortable with SQL you can just load the file, register it as temp table and query it to perform any transformation. A more relevant advantage of DataFrames is the ability to use databricks' spark-csv library to easily read and write CSV files.
Let's assume you will use DataFrames (DF) for now:
Flow: sounds like you should
Load original file to a DF, call it input
Transform it to the new DF, called withDiff
At this point, it would make sense to cache the result, let's call the cached DF result
Now you can save result to the new CSV file
Use result again for whatever else you need