Scala & Spark: Querying native SQL without a registered tempview - scala

Is there a way to execute an SQL statement (including SELECT, FROM, WITH and different types of JOINs) without the need for registering a temp view priorly in Spark using Scala? The goal is to get a DataFrame without any detours from SQL code.
An example how it works (using an existing DataFrame and registering a tempview) is provided by the documentation:
// df is an existing DataFrame
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
The problem with an existing DataFrame is, that only sub quantities of the underlying DataFrame, used for generating the tempview, can be selected. This becomes very impractical if the SQL statements use data from many different tables or Views. Something like
// SQL is directly executed on database
val dfView = spark.sql(connectionProperties,
"SELECT *
FROM DATABASE_USER.V_VIEW_IN_DATABASE v1
JOIN DATABASE_USER.V_VIEW2_IN_DATABASE v2
ON v1.key = v2.key")
dfView.show()
with automatic type inference would solve my problem. I'm chasing one potential path pointed out in this question.
Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation, Oracle 12c database

Try:
import org.apache.spark.ml.feature.SQLTransformer
val df = spark.read.format("jdbc")
.options(Map("url" -> "jdbc:postgresql://<server>:<port>/<db>?user=<user>&password=<password>", "dbtable" -> "<dbtable>"))
.load()
val ans = new SQLTransformer()
.setStatement("SELECT value + id AS points FROM __THIS__")
.transform(df)
ans.show()
+------+
|points|
+------+
| 101.0|
| 202.0|
+------+
If you want some sugar coating:
object ImprovedDataFrameContext {
import org.apache.spark.ml.feature.SQLTransformer
implicit class ImprovedDataFrame(df: org.apache.spark.sql.DataFrame) {
def T(query: String): org.apache.spark.sql.DataFrame = {
new SQLTransformer().setStatement(query).transform(df)
}
}
}
import ImprovedDataFrameContext._
val df = spark.read.format("jdbc")
.options(Map("url" -> "jdbc:postgresql://<server>:<port>/<db>?user=<user>&password=<password>", "dbtable" -> "<dbtable>"))
.load()
.T("<sql_query>")
.show()
+------+
|points|
+------+
| 101.0|
| 202.0|
+------+

Related

Script result in sqlContext on registered temp table in Scala has minor difference than using Reduce in RDD

I am learning Scala now, and notice something that I don't understand why, I have a result to be generated registered Temp Table via sqlContext on a DataFrame derived from a RDD, the RDD is from a hdfs file exported from mysql.
Raw hdfs data can be retrieved from here (2MB):
https://github.com/mdivk/175Scala/tree/master/data/part-00000
Here is the script used on mysql:
select avg(order_item_subtotal) from order_items;
+--------------------------+
| avg(order_item_subtotal) |
+--------------------------+
| 199.32066922046081 |
+--------------------------+
On Scala:
sqlContext:
> scala> val res = sqlContext.sql("select avg(order_item_subtotal) from
> order_items")
> +------------------+
| _c0|
+------------------+
|199.32066922046081|
+------------------+
So they are the same, exactly same, which is expected;
RDD (please use the data file from https://github.com/mdivk/175Scala/tree/master/data/part-00000):
val orderItems = sc.textFile("/public/retail_db/order_items")
val orderItemsRevenue = orderItems.map(oi => oi.split(",")(4).toFloat)
val totalRev = orderItemsRevenue.reduce((total, revenue) => total + revenue)
res4: Float = 3.4326256E7
val cnt = orderItemsRevenue.count
val avgRev = totalRev/cnt
avgRev: Float = 199.34178
As you can see, the avgRev is 199.34178, not what we calculated above in mysql and sqlContext 199.32066922046081
I do not think this is an acceptable discrepancy but I could be wrong, am I missing anything here?
It would be appreciated if you can help me understand this. Thank you.
Maybe your mysql table is using double thus you could try to change:
val orderItemsRevenue = orderItems.map(oi => oi.split(",")(4).toFloat)
into
val orderItemsRevenue = orderItems.map(oi => oi.split(",")(4).toDouble)
since you need more precise results you can use double instead of float. Float is 32bits and double 64 which makes it useful when memory usage is critical otherwise use almost always double.

Scala Spark: Performance issue renaming huge number of columns

To be able to work with columnnames of my DataFrame without escaping the . I need a function to "validify" all columnnames - but none of the methods I tried does the job in a timely manner (I'm aborting after 5 minutes).
The dataset I'm trying my algorithms on is the golub Dataset (get it here). It's a 2.2MB CSV file with 7200 columns. Renaming all columns should be a matter of seconds
Code to read the CSV in
var dfGolub = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("golub_merged.csv")
.drop("_c0") // drop the first column
.repartition(numOfCores)
Attempts to rename columns:
def validifyColumnnames1(df : DataFrame) : DataFrame = {
import org.apache.spark.sql.functions.col
val cols = df.columns
val colsRenamed = cols.map(name => col(name).as(name.replaceAll("\\.","")))
df.select(colsRenamed : _*)
}
def validifyColumnnames2[T](df : Dataset[T]) : DataFrame = {
val newColumnNames = ArrayBuffer[String]()
for(oldCol <- df.columns) {
newColumnNames += oldCol.replaceAll("\\.","")
}
df.toDF(newColumnNames : _*)
}
def validifyColumnnames3(df : DataFrame) : DataFrame = {
var newDf = df
for(col <- df.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\.",""))
}
newDf
}
Any ideas what is causing this performance issue?
Setup: I'm running Spark 2.1.0 on Ubuntu 16.04 in local[24] mode on a machine with 16cores * 2 threads and 96GB of RAM
Assuming you know the types you can simply create the schema instead of infering it (infering the schema costs performance and might even be wrong for csv).
Lets assume for simplicity you have the file example.csv as follows:
A.B, A.C, A.D
a,3,1
You can do something like this:
val scehma = StructType(Seq(StructField("A_B",StringType),StructField("A_C", IntegerType), StructField("AD", IntegerType)))
val df = spark.read.option("header","true").schema(scehma).csv("example.csv")
df.show()
+---+---+---+
|A_B|A_C| AD|
+---+---+---+
| a| 3| 1|
+---+---+---+
IF you do not know the info in advance you can use infer schema as you did before, then you can use the dataframe to generate the schema:
val fields = for {
x <- df.schema
} yield StructField(x.name.replaceAll("\\.",""), x.dataType, x.nullable)
val schema = StructType(fields)
and the reread the dataframe using that schema as before

SparkSQL - Read parquet file directly

I am migrating from Impala to SparkSQL, using the following code to read a table:
my_data = sqlContext.read.parquet('hdfs://my_hdfs_path/my_db.db/my_table')
How do I invoke SparkSQL above, so it can return something like:
'select col_A, col_B from my_table'
After creating a Dataframe from parquet file, you have to register it as a temp table to run sql queries on it.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("src/main/resources/peopleTwo.parquet")
df.printSchema
// after registering as a table you will be able to run sql queries
df.registerTempTable("people")
sqlContext.sql("select * from people").collect.foreach(println)
With plain SQL
JSON, ORC, Parquet, and CSV files can be queried without creating the table on Spark DataFrame.
//This Spark 2.x code you can do the same on sqlContext as well
val spark: SparkSession = SparkSession.builder.master("set_the_master").getOrCreate
spark.sql("select col_A, col_B from parquet.`hdfs://my_hdfs_path/my_db.db/my_table`")
.show()
Suppose that you have the parquet file ventas4 in HDFS:
hdfs://localhost:9000/sistgestion/sql/ventas4
In this case, the steps are:
Charge the SQL Context:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read the parquet File:
val ventas=sqlContext.read.parquet("hdfs://localhost:9000/sistgestion/sql/ventas4")
Register a temporal table:
ventas.registerTempTable("ventas")
Execute the query (in this line you can use toJSON to pass a JSON format or you can use collect()):
sqlContext.sql("select * from ventas").toJSON.foreach(println(_))
sqlContext.sql("select * from ventas").collect().foreach(println(_))
Use the following code in intellij:
def groupPlaylistIds(): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder.appName("FollowCount")
.master("local[*]")
.getOrCreate()
val sc = spark.sqlContext
val d = sc.read.format("parquet").load("/Users/CCC/Downloads/pq/file1.parquet")
d.printSchema()
val d1 = d.select("col1").filter(x => x!='-')
val d2 = d1.filter(col("col1").startsWith("searchcriteria"));
d2.groupBy("col1").count().sort(col("count").desc).show(100, false)
}

Modifying a column after groupBy in SPARK (using SCALA) [duplicate]

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?
Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))
You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")
In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().
Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username
-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id
One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))
Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+
Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai
You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.
Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

Scala spark Select as not working as expected

Hope someone can help. Fairly certain this is something I'm doing wrong.
I have a dataframe called uuidvar with 1 column called 'uuid' and another dataframe, df1, with a number of columns, one of which is also 'uuid'. I would like to select from from df1 all of the rows which have a uuid that appear in uuidvar. Now, having the same column names is not ideal so I tried to do it with
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("uuid").as("another_uuid"), "right_outer").select("*")
However when I show uuidselection I have 2 columns called "uuid". Furthermore, if I try and select the specific columns I want, I am told
cannot resolve 'uuidvar' given input columns
or similar depending on what I try and select.
I have tried to make it simpler and just do
val uuidvar2=uuidvar.select("uuid").as("uuidvar")
and this doesn't rename the column in uuidvar.
Does 'as' not operate as I am expecting it to, am I making some other fundamental error or is it broken?
I'm using spark 1.5.1 and scala 1.10.
Answer
You can't use as when specifying the join-criterion.
Use withColumnRenamed to modify the column before the join.
Seccnd, use generic col function for accessing columns via name (instead of using the dataframe's apply method, e.g. df1(<columnname>)
case class UUID1 (uuid: String)
case class UUID2 (uuid: String, b:Int)
class UnsortedTestSuite2 extends SparkFunSuite {
configuredUnitTest("SO - uuid") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val uuidvar = sc.parallelize( Seq(
UUID1("cafe-babe-001"),
UUID1("cafe-babe-002"),
UUID1("cafe-babe-003"),
UUID1("cafe-babe-004")
)).toDF()
val df1 = sc.parallelize( Seq(
UUID2("cafe-babe-001", 1),
UUID2("cafe-babe-002", 2),
UUID2("cafe-babe-003", 3)
)).toDF()
val uuidselection=df1.join(uuidvar.withColumnRenamed("uuid", "another_uuid"), col("uuid") === col("another_uuid"), "right_outer")
uuidselection.show()
}
}
delivers
+-------------+----+-------------+
| uuid| b| another_uuid|
+-------------+----+-------------+
|cafe-babe-001| 1|cafe-babe-001|
|cafe-babe-002| 2|cafe-babe-002|
|cafe-babe-003| 3|cafe-babe-003|
| null|null|cafe-babe-004|
+-------------+----+-------------+
Comment
.select("*") does not have any effect. So
df.select("*") =^= df
I've always used the withColumnRenamed api to rename columns:
Take this table as an example:
| Name | Age |
df.withColumnRenamed('Age', 'newAge').show()
| Name | newAge |
So to make it work with your code, something like this should work:
val uuidvar_another = uuidvar.withColumnRenamed("uuid", "another_uuid")
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("another_uuid"), "right_outer").select("*")