delete bigquery table from spark by using scala - scala

Is there a way to drop a BigQuery table from Spark by using Scala?
I only find ways to read and write BigQuery table from Spark by using Scala from the example here:
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
Can someone provide an example to drop a BigQuery table? For example, I can drop a table in BigQuery console using this statement "drop table if exists projectid1.dataset1.table1".
Please note that my purpose of removing the existing table is NOT to overwrite. I simply want to remove it. Please help. Thanks.

Please refer to the BigQuery API:
import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.{BigQueryOptions, TableId}
val bq = BigQueryOptions.getDefaultInstance().getService()
val table = bq.getTable(TableId.of("projectid1", "dataset1", "table1"))
if(table != null) {
table.delete()
}
Notice, this should work in Dataproc. In other cluster you will need to properly set the cresentials

Related

pyspark - insert generated primary key in dataframe

I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.
The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()

How to create a new table with the results of SHOW TABLES in Databricks SQL?

I want to do aggregations on the result of
SHOW TABLES FROM databasename
Or create a new table with the result like
CREATE TABLE database.new_table AS (
SHOW TABLES FROM database
);
But I'm getting multiple different errors if I try to do anything else with SHOW TABLES.
Is there another way of doing anything with the result of SHOW TABLES or another way creating a table with all the column names in a database? I have previously worked with Teradata where it's quite easy.
Edit: I only have access to Databricks SQL Analytics. So can only write in pure SQL.
Another way of doing it:
spark.sql("use " + databasename)
df = spark.sql("show tables")
df.write.saveAsTable('databasename.new_table')

Hive create partitioned table based on Spark temporary table

I have a Spark temporary table spark_tmp_view with DATE_KEY column. I am trying to create a Hive table (without writing the temp table to a parquet location. What I have tried to run is spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.result AS SELECT * FROM spark_tmp_view PARTITIONED BY(DATE_KEY DATE)")
The error I got is mismatched input 'BY' expecting <EOF> I tried to search but still haven't been able to figure out the how to do it from a Spark app, and how to insert data after. Could someone please help? Many thanks.
PARTITIONED BY is part of definition of a table being created, so it should precede ...AS SELECT..., see Spark SQL syntax.

spark-cassandra-connector: add column on the fly

The following cqlsh command adds a column to an already existing cassandra table.
cqlsh:demodb> ALTER TABLE users ADD coupon_code varchar;
How would I do the same with the scala spark-cassandra-connector?
I am not seeing reference in the documents.
ALSO: Is there a scaladoc for com.datastax.spark.connector?
You can use withSessionDo method of the CassandraConnector, like this:
import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(conf).withSessionDo { session =>
session.execute("ALTER TABLE users ADD coupon_code varchar;")
}
See more examples in documentation for Spark Cassandra connector...

Spark Streaming : Write to PSQL Table from Kafka

What is the correct way to write DStream data from Kafka using Spark-Streaming to an SQL table like Postgres?
For ex. I would have this,
val directKafkaStream = KafkaUtils.createDirectStream[..]
Using forEachRDD, I will map the data to a case class. And, do something like
SELECT * FROM table WHERE id = id_from_kafka_rdd;
And, then with the result from this, I will do some other comparisons and decide whether to update the Postgres table with data from Kafka. In effect, I might have to do operations like INSERT, UPDATE etc on the Postgres table.
What is the correct way to do this? Spark SQL, DataFrames or the JDBC connector method? I am a beginner to Spark.
Thanks in advance.