spark query execution time - scala

I have a local hadoop single node and hive installed and I have some hive tables stored in hdfs. Then I configure Hive with MySQL Metastore. And now I installed spark and Im doing some queries over hive tables like this (in scala):
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
result = hiveContext.sql("SELECT * FROM USERS");
result.show
Do you know how to configure spark to show to the execution time of the query? Because for default it is not showing..

Use spark.time().
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
result = hiveContext.sql("SELECT * FROM USERS");
spark.time(result.show)
https://db-blog.web.cern.ch/blog/luca-canali/2017-03-measuring-apache-spark-workload-metrics-performance-troubleshooting

Related

PySpark dataframe to Hive table with partitions

I typically use the below code to write a PySpark data frame into a Hive table. I have a column pxn_dt which will be used to partition the table.
How can I modify the code below so that it will create partitions into the table (with the new month) the next time I run the script?
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sqlContext = SQLContext(spark)
df.createOrReplaceTempView("mytempTable")
sqlContext.sql("create table my_db.table from mytempTable")
I'm trying to use the below line instead but it doesn't seem to work.
sqlContext.sql("create table my_db.table from mytempTable partitioned by(pxn_dt)")

Spark read from Entire Schema Scala

I want to create a Spark object that can read an entire schema instead of just one table from inside that schema. This is because I want to execute a particular query that joins multiple tables (I do not want to read from each table separately and manually recreate the query using Spark as the query is long and complicated). I was hoping it would work something like this:
val Schema_DF = spark.read
.format("jdbc")
.option("url", "jdbc://example.com")
.option("schema", "SCHEMA_NAME")
.option("user", "username")
.option("password", "pass")
.load()
I am able to use a different method to load the query I want as a ResultSet, but this seems long winded as I would then need to convert this to a Dataframe. Any help would be appreciated.
Cheers
You do not need to load whole schemas into spark to do that.
You can query on your DB and get results as a dataframe using the query property.
val jdbcDF = spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", "select c1, c2 from t1")
.load()
Ref : https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Note: Spark will push this query to your Database i.e. Your database will process the query and spark will just fetch the results. Just be careful if it's your live prod database :)

Pyspark - Looking to apply SQL queries to pyspark dataframes

Disclaimer: I'm very new to pyspark and this question might not be appropriate.
I've seen the following code online:
# Get the id, age where age = 22 in SQL
spark.sql("select id, age from swimmers where age = 22").show()
Now, I've tried to pivot using pyspark with the following code:
complete_dataset.createOrReplaceTempView("df")
temp = spark.sql("SELECT core_id from df")
This is the error I'm getting:
'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I figured this would be straightforward but I can't seem to find the solution. Is this posible to do in pyspark?
NOTE: I am on an EMR Cluster using a Pyspark notebook.
In pyspark you can read MySQL table (assuming that you are using MySQL) and create dataframe.
jdbc_url = 'jdbc:mysql://{}:{}#{}/{}?zeroDateTimeBehavior=CONVERT_TO_NULL'.format(
'usrname',
'password',
'host',
'db',
)
table_df = sql_ctx.read.jdbc(url=jdbc_url, table='table_name').select("column_name1", "column_name2")
Where table_df is the dataframe. The you can do required operations on dataframe like filter etc.
table_df.filter(table_df.column1 == 'abc').show()

Spark DataFrame turns empty after writing to table

I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.

how to split a list to multiple partitions and sent to executors

When we use spark to read data from csv for DB as follow, it will automatically split the data to multiple partitions and sent to executors
spark
.read
.option("delimiter", ",")
.option("header", "true")
.option("mergeSchema", "true")
.option("codec", properties.getProperty("sparkCodeC"))
.format(properties.getProperty("fileFormat"))
.load(inputFile)
Currently, I have a id list as :
[1,2,3,4,5,6,7,8,9,...1000]
What I want to do is split this list to multiple partitions and sent to executors, in each executor, run the sql as
ids.foreach(id => {
select * from table where id = id
})
When we load data from cassandra, the connector will generate the query sql as:
select columns from table where Token(k) >= ? and Token(k) <= ?
it means, the connector will scan the whole database, virtually, I needn't to scan the whole table, I just what to get all the data from the table where the k(partition key) in the id list.
the table schema as:
CREATE TABLE IF NOT EXISTS tab.events (
k int,
o text,
event text
PRIMARY KEY (k,o)
);
or how can i use spark to load data from cassandra using pre defined sql statement without scan the whole table?
You simply need to use joinWithCassandra function to perform selection only of the data is required for your operation. But be aware that this function is only available via RDD API.
Something like this:
val joinWithRDD = your_df.rdd.joinWithCassandraTable("tab","events")
You need to make sure that column name in your DataFrame matched the partition key name in Cassandra - see documentation for more information.
The DataFrame implementation is only available in the DSE version of Spark Cassandra Connector as described in following blog post.
Update in September 2020th: support for join with Cassandra was added in the Spark Cassandra Connector 2.5.0