SparkSQL - Read parquet file directly - scala

I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'

After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)

With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()

Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
sqlContext.sql("select * from ventas").collect().foreach(println(_))

Use the following code in intellij:
def groupPlaylistIds(): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder.appName("FollowCount")
.master("local[*]")
.getOrCreate()
val sc = spark.sqlContext
val d = sc.read.format("parquet").load("/Users/CCC/Downloads/pq/file1.parquet")
d.printSchema()
val d1 = d.select("col1").filter(x => x!='-')
val d2 = d1.filter(col("col1").startsWith("searchcriteria"));
d2.groupBy("col1").count().sort(col("count").desc).show(100, false)
}

Related

Create DataFrame / Dataset using Header and Data in two different directories

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")
Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+

How to union multiple csv files in to single csv file

I am writing the below code to convert the union of multiple CSV files and writing the combined data into new file. But I am facing an error.
val filesData=List("file1", "file2")
val dataframes = filesData.map(spark.read.option("header", true).csv(_))
val combined = dataframes.reduce(_ union _)
val data = combined.rdd
val head :Array[String]= data.first()
val memberDataRDD = data.filter(_(0) != head(0))
type mismatch; found : org.apache.spark.sql.Row required: Array[String]
there will not be any issue as long as both csv df have same schema
val df = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp")
val df1 = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp1")
val dfs = List(df, df1)
val dfUnion = dfs.reduce(_ union _)
You can just read multiple paths directly with Spark:
spark.read.option("header", true).csv(filesData:_*)

Spark - UnsupportedOperationException: collect_list is not supported in a window operation

I am using Spark 1.6. I have a dataframe generated from a parquet file with 6 columns. I am trying to group (partitionBy) and order(orderBy) the rows in the dataframe, to later collect those columns in an Array.
I wasn't sure if this actions were possible in Spark 1.6, but in the following answers they show how it can be done:
https://stackoverflow.com/a/35529093/1773841 #zero323
https://stackoverflow.com/a/45135012/1773841 #Ramesh Maharjan
Based on those answers I wrote the following code:
val sqlContext: SQLContext = new HiveContext(sc)
val conf = sc.hadoopConfiguration
val dataPath = "/user/today/*/*"
val dfSource : DataFrame = sqlContext.read.format("parquet").option("dateFormat", "DDMONYY").option("timeFormat", "HH24:MI:SS").load(dataPath)
val w = Window.partitionBy("code").orderBy("date".desc)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val dfCollec = dfData.withColumn("collected", collect_list(struct("col1","col2","col3","col4","col5","col6")).over(w))
So, I followed the pattern written by Ramesh, and I created the sqlContext based on Hive as Zero recommended. But I am still getting the following error:
java.lang.UnsupportedOperationException:
'collect_list(struct('col1,'col2,'col3,'col4,'col5,'col6)) is not
supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1052)
What am I missing still?

Scala & Spark: Querying native SQL without a registered tempview

Is there a way to execute an SQL statement (including SELECT, FROM, WITH and different types of JOINs) without the need for registering a temp view priorly in Spark using Scala? The goal is to get a DataFrame without any detours from SQL code.
An example how it works (using an existing DataFrame and registering a tempview) is provided by the documentation:
// df is an existing DataFrame
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
The problem with an existing DataFrame is, that only sub quantities of the underlying DataFrame, used for generating the tempview, can be selected. This becomes very impractical if the SQL statements use data from many different tables or Views. Something like
// SQL is directly executed on database
val dfView = spark.sql(connectionProperties,
"SELECT *
FROM DATABASE_USER.V_VIEW_IN_DATABASE v1
JOIN DATABASE_USER.V_VIEW2_IN_DATABASE v2
ON v1.key = v2.key")
dfView.show()
with automatic type inference would solve my problem. I'm chasing one potential path pointed out in this question.
Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation, Oracle 12c database
Try:
import org.apache.spark.ml.feature.SQLTransformer
val df = spark.read.format("jdbc")
.options(Map("url" -> "jdbc:postgresql://<server>:<port>/<db>?user=<user>&password=<password>", "dbtable" -> "<dbtable>"))
.load()
val ans = new SQLTransformer()
.setStatement("SELECT value + id AS points FROM __THIS__")
.transform(df)
ans.show()
+------+
|points|
+------+
| 101.0|
| 202.0|
+------+
If you want some sugar coating:
object ImprovedDataFrameContext {
import org.apache.spark.ml.feature.SQLTransformer
implicit class ImprovedDataFrame(df: org.apache.spark.sql.DataFrame) {
def T(query: String): org.apache.spark.sql.DataFrame = {
new SQLTransformer().setStatement(query).transform(df)
}
}
}
import ImprovedDataFrameContext._
val df = spark.read.format("jdbc")
.options(Map("url" -> "jdbc:postgresql://<server>:<port>/<db>?user=<user>&password=<password>", "dbtable" -> "<dbtable>"))
.load()
.T("<sql_query>")
.show()
+------+
|points|
+------+
| 101.0|
| 202.0|
+------+

PySpark - read recursive Hive table

I have a Hive table that has multiple sub-directories in HDFS, something like:
/hdfs_dir/my_table_dir/my_table_sub_dir1
/hdfs_dir/my_table_dir/my_table_sub_dir2
...
Normally I set the following parameters before I run a Hive script:
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
select * from my_db.my_table;
I'm trying to do the same using PySpark,
conf = (SparkConf().setAppName("My App")
...
.set("hive.input.dir.recursive", "true")
.set("hive.mapred.supports.subdirectories", "true")
.set("hive.supports.subdirectories", "true")
.set("mapred.input.dir.recursive", "true"))
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
my_table = sqlContext.sql("select * from my_db.my_table")
and end up with an error like:
java.io.IOException: Not a file: hdfs://hdfs_dir/my_table_dir/my_table_sub_dir1
What's the correct way to read a Hive table with sub-directories in Spark?
What I have found is that these values must be preceded with spark as in:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
Try setting them through ctx.sql() prior to execute the query:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
my_table = sqlContext.sql("select * from my_db.my_table")
Try setting them through SpakSession to execute the query:
sparkSession = (SparkSession
.builder
.appName('USS - Unified Scheme of Sells')
.config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
.config("hive.input.dir.recursive", "true")
.config("hive.mapred.supports.subdirectories", "true")
.config("hive.supports.subdirectories", "true")
.config("mapred.input.dir.recursive", "true")
.enableHiveSupport()
.getOrCreate()
)