I am new to Spark sql, I have a spark sql query running inside a for loop. Example,
val sQuery = "select distinct col1, col2, col3 from HiveDB.HiveTableName"
The query is being executed sequentially for different hive db and table in the loop, almost 200 db and tables. There is a performance hit because the query has to run for entire tables. I tried to rewrite/optimize the query like
val sQuery = "select * from (select col1, col2, col3, dense_rank() over (order by `date` desc) as rnk from HiveDB.HiveTableName ) b where b.rnk =1"
But still I see it is taking same time as compared to first query. Can anyone suggest to optimize the query.
Spark version used 2.3.2
Tables are external ORC format and it is partitioned by yyyy-mm.
Related
For the following operation to run sql statement in spark sql to join two tables in PostgreSQL:
val df = spark.read.jdbc(url, 'select * from table_1 join table_2 on a where x', connproperties);
Will Database Engine execute the joining operation and sends back the joined results? Or will the Database send all records of table_1 and table_2 to spark job and spark job do the joining? Are there some documentation to explain this operation? Thanks!
The PostgreSQL database will only return a single resultset from a single query. If you would use valid SQL, that could be the joined result. Or nothing, in case no records match your conditions.
I am submitting 3 million records to postgres table1 from a staging table table2,I have my update and insert queries as below
UPDATE table1 t set
col1 = stage.col1,
col2 = stage.col2 ,
col3 = stage.stage.col3
from table2 stage
where t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level');
INSERT INTO table1
(col1,
col2,
col3,
col4,
id,
name,
level)
select
stage.col1,
stage.col2,
stage.col3,
stage.col4,
stage.id,
stage.name,
stage.level
from table2 stage
where NOT EXISTS (select
from table1 t where
t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level'));
I am facing performance issues (takes long 1.5 hours) even using the exactly same indexed keys (btree) as defined on the table, In order to test the cause ,I created a replica of the table1 without indexes and I was able to submit entire data in 15 ~ 17 mins approx., So I am inclined to think that indexes are killing the performance on the table as there are so many of them (some unused indexes which I cannot drop due to permission issues).I am looking for suggestions to improve/optimize my query or may be use some other strategy to upsert the data to reduce data load time. Any suggestion is appreciated.
Running an explain analyze on the query helped me to realize the query was never using the defined indexes on target table and was doing a sequential scan on a large number of rows ,the cause was one of the keys used in update/insert was defined without a coalesce in the defined indexes , though it means I have to handle null well before feeding in to my code , but it improved the performance significantly. I am open to further improvements.
I have tried PostgreSQL:count distinct (col1,col2,col3,col4,col5)
in BigQuery :Count distinct concat(col1,col2,col3,col4,col5)
My scenario is I need to get same result as PostgreSQL in BigQuery
Though this scenario works on 3 columns ,I am not getting same value as PostgreSQL for 5 columns.
sample query:
select col1,
count(distinct concat((col1,col2,col3,col4,col5)
from table A
group by col1
when I remove distinct and concat, simple count(col1,col2,col3,col4,col5) gives exact value as populated in PostgreSQL. But i need to have distinct of these columns. Is there any way to achieve this? and does bigquery concat works differently?
Below few options for BigQuery Standard SQL
#standardSQL
SELECT col1,
COUNT(DISTINCT TO_JSON_STRING((col1,col2,col3,col4,col5)))
FROM A
GROUP BY col1
OR
#standardSQL
SELECT col1,
COUNT(DISTINCT FORMAT('%T', [col1,col2,col3,col4,col5]))
FROM A
GROUP BY col1
An alternative suitable for the many databases that don't support that form of COUNT DISTINCT:
SELECT COUNT(*)
FROM (
SELECT DISTINCT Origin, Dest, Reporting_Airline
FROM `fh-bigquery.flights.ontime_201908`
WHERE FlightDate_year = "2018-01-01"
)
My guess on why CONCAT didn't work in your sample: Do you have any null columns?
I want to run the following query using Spark Python (pull data from Cassandra) similar to Oracle SQL:
select name, value, count(*) from table_name order by name, value
group by name....
any examples/samples will help, thanks!
Using Scala
val results = sqlContext.sql(
"select name, value, count(*) from table group by name, value"
)
I was able to insert data into a Hive table from my spark code using HiveContext like below
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS e360_models.employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1210, 'rahul', 55) t")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1211, 'sriram pv', 35) t")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1212, 'gowri', 59) t")
val result = sqlContext.sql("FROM e360_models.employee SELECT id, name, age")
result.show()
But, this approach is creating a separate file in the warehouse for every insertion like below
part-00000
part-00000_copy_1
part-00000_copy_2
part-00000_copy_3
Is there any way to avoid this and just append the new data to a single file or is there any other better way to insert data into hive from spark?
No, there is no way to do that. Each new insert will create a new file. It's not a Spark "issue", but a general behavior you can experience with Hive too. The only way is to perform a single insert with the UNION of all your data, but if you need to do multiple inserts, you'll have multiple files.
The only thing you can do is to enable file merging in hive (look at it here: Hive Create Multi small files for each insert in HDFS and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties).