Pyspark streaming error with field concatenation - pyspark

I am new to spark streaming. I have created a spark structured streaming application using pyspark
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_broker) \
.option("subscribe", f"{main_topic}") \
.load()
userExpDF1 = df.selectExpr("substring(value, 6) as avro_value").where("topic='mainTopic'") \
.select(from_avro("avro_value", user_exp_schema).alias("record")) \
.selectExpr("concat(record.empId, record.accountId)", "record.accountId", "record.empId", "record.salary", "record.age")
I am trying to join 2 fields but I am getting some weird error like below. This happens only if I include the original column names along with concatenated column. That is I have the "concat(record.empId, record.accountId)", "record.accountId", "record.empId" fields in the selectExpr but when I remove "record.accountId" and "record.empId" from the selectExpr then the concatenation works. Also I can just have the accountId and empId field without concatenation and it still works. Having the original fields along with concatenated field is causing this issue. I tried changing the field names and I can see the same thing happens there as well. Also I tried .select() instead of selectExpr() but got the same error.
22/07/05 19:05:17 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 70: IDENTIFIER expected instead of '['
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 70: IDENTIFIER expected instead of '['
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:196)
at org.codehaus.janino.Parser.read(Parser.java:3705)
at org.codehaus.janino.Parser.parseQualifiedIdentifier(Parser.java:446)
at org.codehaus.janino.Parser.parseReferenceType(Parser.java:2569)
at org.codehaus.janino.Parser.parseType(Parser.java:2549)
at org.codehaus.janino.Parser.parseFormalParameter(Parser.java:1688)
at org.codehaus.janino.Parser.parseFormalParameterList(Parser.java:1639)
at org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1620)
at org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518)
at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028)
at org.codehaus.janino.Parser.parseClassBody(Parser.java:841)
at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736)
at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941)
at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1403)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1500)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1497)
at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:338)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:336)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:297)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:304)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:46)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2965)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2965)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:589)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:584)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:584)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:226)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
22/07/05 19:05:17 WARN WholeStageCodegenExec: Whole-stage codegen disabled for plan (id=1):
Any suggestions please.

Related

Filter rows of snowflake table while reading in pyspark dataframe

I have a huge snowflake table. I want to do some transformation on the table in pyspark. My snowflake table has a column called 'snapshot'. I only want to read the current snapshot data in pyspark dataframe and do transformation on the filtered data.
So, Is there a way to apply filtering the rows while reading the snowflake table in spark dataframe (I don't want to read the entire snowflake table in memory because it is not efficient) or do I need to read entire snowflake table (in spark dataframe) and then apply filter to get the latest snapshot something as below?
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
snowflake_database="********"
snowflake_schema="********"
source_table_name="********"
snowflake_options = {
"sfUrl": "********",
"sfUser": "********",
"sfPassword": "********",
"sfDatabase": snowflake_database,
"sfSchema": snowflake_schema,
"sfWarehouse": "COMPUTE_WH"
}
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("dbtable",snowflake_database+"."+snowflake_schema+"."+source_table_name) \
.load()
df = df.where(df.snapshot == current_timestamp()).collect()
There are forms of filters (filter or where functionality of Spark DataFrame) that Spark doesn't pass to the Spark Snowflake connector. That means, in some situations, you may get more records than you expect.
The safest way would be to use a SQL query directly:
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("query","SELECT X,Y,Z FROM TABLE1 WHERE SNAPSHOT==CURRENT_TIMESTAMP()") \
.load()
Of course, if you want to use filter/where functionality of Spark DataFrame, check the Query History in Snowflake UI to see if the query generated has the right filter applied.

spark struct streaming writeStream output no data but no error

I have a struct streaming job which reads message from Kafka topic then saves to dbfs. The code is as follows:
input_stream = spark.readStream \
.format("kafka") \
.options(**kafka_options) \
.load() \
.transform(create_raw_features)
# tranformation by 7 days rolling window
def transform_func(df):
window_spec = window("event_timestamp", "7 days", "1 day")
return df \
.withWatermark(eventTime="event_timestamp", delayThreshold="2 days") \
.groupBy(window_spec.alias("window"), "customer_id") \
.agg(count("*").alias("count")) \
.select("window.end", "customer_id", "count")
result = input_stream.transform(transform_func)
query = result \
.writeStream \
.format("memory") \
.queryName("test") \
.option("truncate","false").start()
I can see the checkpointing is working fine. However, there is no data output.
spark.table("test").show(truncate=False)
Show empty table. Any clue why?
I found the issue. in the Spark documentation output mode section, it states:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (i.e. after watermark is crossed).
Since I didn't specify the output mode explicitly, append is applied implicitly, which means the first output will occur only after the watermark threshold is passed.
To get the output per micro-batch, use output mode update or complete instead.
This works for me now
query = result \
.writeStream \
.format("memory") \
.outputMode("update") \
.queryName("test") \
.option("truncate","false").start()

pyspark : Spark Streaming via socket

I am trying to read the datastream from the socket (ps.pndsn.com) and writing it into the temp_table for further processing but currently, issue I am facing is that temp_table which I created as part of writeStream is empty even though streaming is happening at real-time. So looking for help in this regard.
Below is the code snippet :
# Create DataFrame representing the stream of input streamingDF from connection to ps.pndsn.com:9999
streamingDF = spark \
.readStream \
.format("socket") \
.option("host", "ps.pndsn.com") \
.option("port", 9999) \
.load()
# Is this DF actually a streaming DF?
streamingDF.isStreaming
spark.conf.set("spark.sql.shuffle.partitions", "2") # keep the size of shuffles small
query = (
streamingDF
.writeStream
.format("memory")
.queryName("temp_table") # temp_table = name of the in-memory table
.outputMode("Append") # Append = OutputMode in which only the new rows in the streaming DataFrame/Dataset will be written to the sink
.start()
)
Streaming output :
{'channel': 'pubnub-sensor-network',
'message': {'ambient_temperature': '1.361',
'humidity': '81.1392',
'photosensor': '758.82',
'radiation_level': '200',
'sensor_uuid': 'probe-84d85b75',
'timestamp': 1581332619},
'publisher': None,
'subscription': None,
'timetoken': 15813326199534409,
'user_metadata': None}
The output of the temp_table is empty.

pyspark 2.4.x structured streaming foreachBatch not running

I am working with spark 2.4.0 and python 3.6. I am developing a python program with pyspark structured streaming actions. The program runs two readstream reading from two sockets, and after made a union of these two streaming dataframe. I tried spark 2.4.0 and 2.4.3 but nothing changed.
Then I perform a unique writestream in order to write just one output streaming dataframe. THAT WORKS WELL.
However, since I need to write also a non streaming dataset for all the micro-batches, I coded a foreachBatch call inside the writestream. THAT DOESN'T WORK.
I put spark.scheduler.mode=FAIR in spark.defaults.conf. I am running through spark-submit, but even though I tried with python3 directly, it doesn't work at all. It looks like as it didn't execute the splitStream function referred in the foreachBatch. I tried adding some print in the splitStream function, without any effects.
I made many attempting, but nothing changed, I submitted via spark-submit and by python. I am working on a spark standalone cluster.
inDF_1 = spark \
.readStream \
.format('socket') \
.option('host', host_1) \
.option('port', port_1) \
.option("maxFilesPerTrigger", 1) \
.load()
inDF_2 = spark \
.readStream \
.format('socket') \
.option('host', host_2) \
.option('port', port_2) \
.option("maxFilesPerTrigger", 1) \
.load() \
.coalesce(1)
inDF = inDF_1.union(inDF_2)
#--------------------------------------------------#
# write streaming raw dataser R-01 plateMeasures #
#--------------------------------------------------#
def splitStream(df, epoch_id):
df \
.write \
.format('text') \
.outputMode('append') \
.start(path = outDir0)
listDF = df.collect()
print(listDF)
pass
stageDir = dLocation.getLocationDir('R-00')
outDir0 = dLocation.getLocationDir(outList[0])
chkDir = dLocation.getLocationDir('CK-00')
query0 = programName + '_q0'
q0 = inDF_1 \
.writeStream \
.foreachBatch(splitStream) \
.format('text') \
.outputMode('append') \
.queryName(query0) \
.start(path = stageDir
, checkpointLocation = chkDir)
I am using foreachBatch because I need to write several sinks for each input microbatch.
Thanks a lot to everyone could try to help me.
I have tried this in my local machine and works for Spark > 2.4.
df.writeStream
.foreachBatch((microBatchDF, microBatchId) => {
microBatchDF
.withColumnRenamed("value", "body")
.write
.format("console")
.option("checkpointLocation","checkPoint")
.save()
})
.start()
.awaitTermination()

Read data from bigquery with spark scala

I'm trying to read data from bigquery and print those. Here what I tried,
// Initialize Spark session
val spark = SparkSession
.builder
.master("local")
.appName("Word Count")
.config("fs.gs.project.id", "bigquery-public-data")
.config("google.cloud.auth.service.account.enable", "true")
.config("fs.gs.auth.service.account.json.keyfile", "<key_file>")
.getOrCreate()
val macbeth = spark.sql("SELECT * FROM shakespeare WHERE corpus = 'macbeth'").persist()
macbeth.show(100)
But this gives me an error as follows,
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: shakespeare; line 1 pos 14
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'shakespeare' not found in database 'default';
I couldn't find a way to fix this. Please help me to read data from this dataset.
Table or view not found: shakespeare; line 1 pos 14
When BigQuery looks for a table it looks for it under the projectId and the dataset. In your code I see two possible issues:
projectId - You are using BigQuery public project as your projectId bigquery-public-data you need to change the value of this variable to a correct value
datasetId - In your query you didn't indicate the dataset which store shakespeare table