I want to insert the value of a column (stores) into a postgres jsonb column. I'm reading this data from a csv file and it is like below when I read it:
+------------------------------------+
|stores |
+------------------------------------+
|[28, 29, 32, 33, 35, 36, 37, 38, 39]|
|[28, 29, 32, 33, 35, 36, 37, 38, 39]|
+------------------------------------+
The schema is like this:
root
|-- stores: string (nullable = true)
Inserting this column into a postgres jsonb column has below error:
ERROR: column "stores" is of type jsonb but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
So I tried to change the column to json like this:
.withColumn("stores", to_json(split(regexp_replace(col("stores"), '\\[|\\]', ''), ",")))
Also I tried this which has error:
.withColumn("stores", to_json(col("stores")))
cannot resolve 'to_json(stores)' due to data type mismatch: Input type string must be a > > struct, array of structs or a map or array of map
Then tried to change the schema and use ArrayType which faced the error that it is not supported by csv.
This is the reading and inserting code.
df = spark.read \
.option("header", True) \
.format("csv") \
.load(path) \
.withColumn("stores", to_json(split(regexp_replace(col("stores"), '\\[|\\]', ''), ",")))
df.select(col("stores")) \
.write \
.format("jdbc") \
.option("url", db_url) \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", target_table) \
.option("user", db_user) \
.option("password", db_password) \
.mode("append") \
.save()
How should I do that?
Related
I'm trying to create a function that will accept any number of DF columns and return if it contains duplicates or not.
something like that:
def find_duplicates(df, *args):
df \
.groupby([args]) \
.count() \
.where('count > 1') \
.sort('count', ascending=False) \
.show()
So I want to call the function using some DataFrame name and list the columns
find_duplicates(df, 'col', 'col2', 'col3')
is there any simple way?
def QueryDB(sqlQuery):
jdbcUrl = mssparkutils.credentials.getSecret("param1","DBJDBCConntring","param3")
spark=SparkSession.builder.appName("show results").getOrCreate()
dbcdf = (spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", sqlQuery)
.load()
)
return jdbcdf
df= QueryDB("INSERT INTO schema.table1 (column1, column2) output inserted.column1 values('one', 'two')")
df.show()
the notebook runs without any error but no rows are inserted. any suggestion or sample code to insert into table.
spark.read.format("jdbc") is to read JDBC. If you want to insert data to JDBC you'd want something like this
jdbcDF.write \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()
I have a streaming dataframe from kafka and I need to pivot two columns. This is the code I'm currently using:
streaming_df = streaming_df.groupBy('Id','Date')\
.pivot('Var')\
.agg(first('Val'))
query = streaming_df.limit(5) \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("stream") \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
`
I recieve the following error:
pyspark.sql.utils.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
pyspark version: 3.1.1
any ideas how to implement pivot with a streaming dataframe ?
The pivot transformation is not supported by Spark when applying to streaming data.
What you can do is to use the foreachBatch with a user defined function like this:
def apply_pivot(stream_df, batch_id):
# Here your pivot transformation
stream_df \
.groupBy('Id','Date') \
.pivot('Var') \
.agg(first('Val')) \
.write \
.format('memory') \
.outputMode('append') \
.queryName("stream")
query = streaming_df.limit(5) \
.writeStream \
.foreachBatch(apply_pivot) \
.start()
time.sleep(50)
spark.sql("select * from stream").show(20, False)
query.stop()
Let me know if it helped you!
I am receiving Kafka stream in pyspark. Currently I am grouping it by one set of fields and writing updates to database:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic)
...
df = df \
.groupBy("myfield1") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
query = df \
.writeStream \
.outputMode("update") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df, epoch)) \
.start()
query.awaitTermination()
Can I take the same chain in the middle and create another grouping like
df2 = df \
.groupBy("myfield2") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
and write it's ooutput into different place in parallel?
Where to call writeStream and awaitTermination?
Yes, you can branch a Kafka input stream into as many streaming queries as you like.
You need to consider the following:
query.awaitTermination is a blocking method, which means whatever code you are writing after this method will not be executed until this query gets terminated.
Each "branched" streaming query will run in parallel and is it important that you define a checkpoint location in each of their writeStream calls.
Overall, your code needs to have the following structure:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic) \
.[...]
# note that I changed the variable name to "df1"
df1 = df \
.groupBy("myfield1") \
.[...]
df2 = df \
.groupBy("myfield2") \
.[...]
query1 = df1 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc1") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df1, epoch)) \
.start()
query2 = df2 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc2") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df2, epoch)) \
.start()
spark.streams.awaitAnyTermination
Just an additional remark: In the code you are showing, you are overwriting df, so the derivation of df2 might not get you the results as you were intended.
Am trying to truncate an Oracle table using pyspark using the below code
truncatesql = """ truncate table mytable """
mape=spark.read \
.format("jdbc") \
.option("url", DB_URL) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", truncatesql) \
.load()
but it keeps throwing java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended how can I truncate a table using direct SQL query ?
Try by wrapping your query with an alias.
Example:
truncatesql = """(truncate table mytable)e"""
mape=spark.read \
.format("jdbc") \
.option("url", DB_URL) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", truncatesql) \
.load()