Check the Metastore for the Table availability in Spark - scala

I want to check if the table is available in a specific hive database before using the below select query.
How to get the information from the metastore.
sparkSession.sql("select * from db_name.table_name")

you can run below command before run your operation on the table
sparkSession.sql("use databaseName");
val df = sparkSession.sql("show tables like 'tableName'")
if(df.head(1).isEmpty == false){
//write the code
}

Try this
#This will give the list of tables in the database as a dataframe.
tables = sparkSession.sql('SHOW TABLES IN '+db_name)
#You can collect them to bring in a list of row items
tables_list = tables.select('tableName').collect()
# convert them into a python array
tables_list_arr=[]
for table_name in tables:
tables_list_arr.append(str(table_name['tableName']))
# Then execute your command
if(table_name in tables_list_arr):
sparkSession.sql("select * from db_name.table_name")

Related

Create clickhouse temporary table with doobie

I want to create clickhouse temporary table with doobie, but i don't know how to add session_id parameter in query
I tried
val sql =
sql"CREATE TEMPORARY TABLE " ++ Fragment.const(tableName) ++ sql"( call_id String ) ENGINE Memory()"
HC.setClientInfo("session_id",sessionId.toString) *> sql.update.run
but this not working

Execute SQL stored in dataframe using pyspark

I have a list of SQL's stored in a hive table column, I have to get one sql at a time from the hive table and execute the SQL, I'm getting the SQL as a dataframe, but can anyone tell me how to execute the SQL stored as a dataframe?
The column parameter_value contains the SQL
extrac_sql = spark.sql(""" select parameter_value
from schema_name.table_params
where project_name = 'some_projectname'
and sub_project_name = 'some_sub_project'
and parameter_name = 'extract_sql' """)
now the extract_sql contains the sql, how to execute it?
You can do it as follows:
sqls = spark.sql(""" select parameter_value
from schema_name.table_params
where project_name = 'some_projectname'
and sub_project_name = 'some_sub_project'
and parameter_name = 'extract_sql' """).collect()
for sql in sqls:
spark.sql(sql[0]).show()

DB2 Update statement not working using JDBC

I have a few rows stored in a source table (as defined as $schema.$sourceTable in the UPDATE query below). This table has 3 columns: TABLE_NAME, PERMISSION_TAG_COL, PT_DEPLOYED
I have an update statement stored in a string like:
var update_PT_Deploy = s"UPDATE $schema.$sourceTable SET PT_DEPLOYED = 'Y' WHERE TABLE_NAME = '$tableName';"
My source table does have rows with TABLE_NAME as $tableName (parameter) as I inserted rows into this table using another function of my program. The default value of PT_DEPLOYED when I inserted the rows was specified as NULL.
I'm trying to execute update using JDBC in the following manner:
println(update_PT_Deploy)
val preparedStatement: PreparedStatement = connection.prepareStatement(update_PT_Deploy)
val row = preparedStatement.execute()
println(row)
println("row updated in table successfully")
preparedStatement.close()
The above piece of code does not throw any exception, but when I query my table in a tool like DBeaver, the NULL value of PT_DEPLOYED does not get updated to Y.
If I execute the same query as mentioned in update_PT_Deploy inside DBeaver, the query works and the table updates. I am sure I am following the correct steps..

How to delete data from Hive external table for Non-Partition column?

I have created an external table in Hive partitioned by client and month.
The requirement asks to delete the data for ID=201 from that table but it's not partitioned by the ID column.
I have tried to do with Insert Overwrite but it's not working.
We are using Spark 2.2.0.
How can I solve this problem?
val sqlDF = spark.sql("select * from db.table")
val newSqlDF1 = sqlDF.filter(!col("ID").isin("201") && col("month").isin("062016"))
val columns = newSqlDF1.schema.fieldNames.mkString(",")
newSqlDF1.createOrReplaceTempView("myTempTable") --34
spark.sql(s"INSERT OVERWRITE TABLE db.table PARTITION(client, month) select ${columns} from myTempTable")

Load multiple .csv-files into one table and create ID per .csv -postgres

Heyho. I am using Postgresql 9.5 and I am desperating at a problem.
I have multiple .csv-Files (40) and all of them have the same columncount und -names. I would now like to import them into one table, but I want an ID per .csv-file. Is it possible to automate this in postgres? (including adding a new id column) And how?
The approach might look like this:
test1.csv ==> table_agg ==> set ID = 1
test2.csv ==> table_agg ==> set ID = 2
.
.
.
test40.csv ==> table_agg ==> set ID = 40
I would be very glad if someone could help me
Add a table that contains the filename and other info you would like to add to each dataset. Add a serial column, that you can use as a foreign key in your data table, i.e. a dataset identifier.
Create the data table. Add a foreign key field to refer to the dataset entry in the other table.
Use a Python script to parse and import the csv files into the database. First add the entry to the datasets table. Then determine the dataset ID and insert the rows into the data table with the corresponding dataset ID set.
My simple solution to assign an ID to each .csv-file in Python and to output all .csv-files in one.
import glob, os, pandas as pd
path =r'PathToFolder'
# all .csv-files in this folder
allFiles = glob.glob(path + "/*.csv")
# safe DFs in list_
list_ = []
# DF for later concat
frame = pd.DataFrame()
# ID per DF/.csv
count = 0
for file_ in allFiles:
# read .csv-files
df = pd.read_csv(file_,index_col=None,skiprows=[1], header=0)
# new column with ID per DF
df['new_id'] = count
list_.append(df)
count = count + 1
frame = pd.concat(list_)
frame.to_csv('PathToOuputCSV', index = False)
Continue with SQL:
CREATE TABLE statement..
COPY TABLE_NAME FROM 'PathToCSV' DELIMITER ',' CSV HEADER;