I typically use the below code to write a PySpark data frame into a Hive table. I have a column pxn_dt which will be used to partition the table.
How can I modify the code below so that it will create partitions into the table (with the new month) the next time I run the script?
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sqlContext = SQLContext(spark)
df.createOrReplaceTempView("mytempTable")
sqlContext.sql("create table my_db.table from mytempTable")
I'm trying to use the below line instead but it doesn't seem to work.
sqlContext.sql("create table my_db.table from mytempTable partitioned by(pxn_dt)")
Related
I am trying to retrieve data from a database made in Hive into my Spark and even if there's data in the DB (I checked it with Hive) doing a query with Spark returns no rows (it returns the column information though).
I have copied the hive-site.xml file into the Spark configuration folder (was asked for).
IMPORTS
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveContext
Creating a Spark session:
val spark = SparkSession.builder().appName("Reto").config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path").enableHiveSupport().getOrCreate()
spark.sql("show databases").show()
Getting data:
spark.sql("USE retoiabd")
val churn = spark.sql("SELECT count(*) FROM churn").show()
Output:
count(1) = 0
After checking it out with our teacher there was an issue with the creation of the tables themselves in Hive.
We created the table like this:
CREATE TABLE name (columns)
Instead of like this:
CREATE EXTERNAL TABLE name (columns)
How can I transform this from SQL to PySpark without using Pyspark.SQL or Pyspark function expr? I'm using a cross join between tables.
SUM(CASE WHEN (s.Date BETWEEN d.StartDate AND d.EndDate) AS SalesSum
I've tried using the PySpark version .between() and .agg() for summing but I'm calling on columns from multiple tables from the cross join and haven't been able to successfully recreate a PySpark version.
i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table.
I tried to do this as follows:
import io.delta.tables._
val not_equal_string = df.schema.fieldNames.map(fn =>
s"coalesce(not ((updates.${fn} = history.${fn}) or (isnull(history.${fn}) and isnull(updates.${fn})) ),false)"
).reduceLeft((x,y) => s"$x OR $y ")
val deltaTable = DeltaTable.forPath(spark, "s3a://sparkdata/delta-table")
deltaTable.as("history").merge(
df.as("updates"), "updates.EquipmentKey = history.EquipmentKey"
).whenMatched(not_equal_string).updateAll().whenNotMatched().insertAll().execute()
this works but when i look in the resulting delta table i see that it effectively doubled in size even if i didn't update a single record. A new json file was generated with a remove for every old partition and an add with all new partitions.
when i just run a sql join with the whenMatched criterion as a where condition, i don't get a single row.
i would expect the delta table to be untouched after such a merge operation. am i missing something simple ?
I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.
I have a local hadoop single node and hive installed and I have some hive tables stored in hdfs. Then I configure Hive with MySQL Metastore. And now I installed spark and Im doing some queries over hive tables like this (in scala):
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
result = hiveContext.sql("SELECT * FROM USERS");
result.show
Do you know how to configure spark to show to the execution time of the query? Because for default it is not showing..
Use spark.time().
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
result = hiveContext.sql("SELECT * FROM USERS");
spark.time(result.show)
https://db-blog.web.cern.ch/blog/luca-canali/2017-03-measuring-apache-spark-workload-metrics-performance-troubleshooting