What is the correct way to write DStream data from Kafka using Spark-Streaming to an SQL table like Postgres?
For ex. I would have this,
val directKafkaStream = KafkaUtils.createDirectStream[..]
Using forEachRDD, I will map the data to a case class. And, do something like
SELECT * FROM table WHERE id = id_from_kafka_rdd;
And, then with the result from this, I will do some other comparisons and decide whether to update the Postgres table with data from Kafka. In effect, I might have to do operations like INSERT, UPDATE etc on the Postgres table.
What is the correct way to do this? Spark SQL, DataFrames or the JDBC connector method? I am a beginner to Spark.
Thanks in advance.
Related
I have a dataframe and for each row, I want to insert this row in postgres databases and returning the generated primary key in this dataframe. I don't find a good way to do this.
I'm trying with rdd but it doesn't works (pg8000 get inserted id into dataframe)
I think it is possible with this process :
loop on dataframe.collect() in order to process the sql insert
make a sql select for a second dataframe
join the first dataframe with the second
But I think this is not optimized.
Do you have any idea ?
I'm using pyspark in aws glue job. Thanks.
The only things that you can optimized are the data inserting and connectivity.
As you mentioned that you have totally two operations, one is the data inserting and another one is to collect the data inserted. Based on my understanding, either spark jdbc or python connector like psycopg2 will not return the primary key of the data that you inserted. Therefore, you need to do it separately.
Back to your question:
You don't need to use the for loop to do the inserting or .collect() to convert back to python object. You can use spark-postgresql jdbc to do it directly with dataframe:
df\
.write.mode('append').format('jdbc')\
.option('driver', 'org.postgresql.Driver')\
.option('url', url)\
.option('dbtable', table_name)\
.option('user', user)\
.option('password', password)\
.save()
I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.
Is there a way to drop a BigQuery table from Spark by using Scala?
I only find ways to read and write BigQuery table from Spark by using Scala from the example here:
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
Can someone provide an example to drop a BigQuery table? For example, I can drop a table in BigQuery console using this statement "drop table if exists projectid1.dataset1.table1".
Please note that my purpose of removing the existing table is NOT to overwrite. I simply want to remove it. Please help. Thanks.
Please refer to the BigQuery API:
import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.{BigQueryOptions, TableId}
val bq = BigQueryOptions.getDefaultInstance().getService()
val table = bq.getTable(TableId.of("projectid1", "dataset1", "table1"))
if(table != null) {
table.delete()
}
Notice, this should work in Dataproc. In other cluster you will need to properly set the cresentials
I have a scenario and would like to get an expert opinion on it.
I have to load a Hive table in partitions from a relational DB via spark (python). I cannot create the hive table as I am not sure how many columns there are in the source and they might change in the future, so I have to fetch data by using; select * from tablename.
However, I am sure of the partition column and know that will not change. This column is of "date" datatype in the source db.
I am using SaveAsTable with partitionBy options and I am able to properly create folders as per the partition column. The hive table is also getting created.
The issue I am facing is that since the partition column is of "date" data type and the same is not supported in hive for partitions. Due to this I am unable to read data via hive or impala queries as it says date is not supported as partitioned column.
Please note that I cannot typecast the column at the time of issuing the select statement as I have to do a select * from tablename, and not select a,b,cast(c) as varchar from table.
I want to write to cassandra from a data frame and I want to exclude the rows if a particular row is already existing (i.e Primary key- though upserts happen I don't want to change the other columns) using spark-cassandra connector. Is there a way we can do that?
Thanks.!
You can use the ifNotExists WriteConf option which was introduced in this pr.
It works like so:
val writeConf = WriteConf(ifNotExists = true)
rdd.saveToCassandra(keyspaceName, tableName, writeConf = writeConf)
You can do
sparkConf.set("spark.cassandra.output.ifNotExists", "true")
With this config
if partition key and clustering column are same as row which exists in cassandra: write will be ignored
else write will be performed
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html#reference_ds_gp2_1jp_xj__if-not-exists
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters
Srinu, this all boils down to "read before write" no matter whether you are using Spark or not.
But there is IF NOT EXISTS clause:
If the column exists, it is updated. The row is created if none
exists. Use IF NOT EXISTS to perform the insertion only if the row
does not already exist. Using IF NOT EXISTS incurs a performance hit
associated with using Paxos internally. For information about Paxos,
see Cassandra 2.1 documentation or Cassandra 2.0 documentation.
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html