pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram - pyspark

I am having glue streaming job, and I need to write the data as stream but after applying some processing, so I did the following:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
glueContext.forEachBatch(
frame=data_frame_DataSource0,
batch_function=processBatch,
options={
"windowSize": window_size,
"checkpointLocation": s3_path_spark
}
)
and in processBatch I do some processing and at the end of it i do the following:
df.writeStream.format("hudi").options(**combinedConf).outputMode('append').start()
I am getting the following error:
pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram
as far as I unserstand that the df I am trying to write is not streaming that's why it's giving the error, I am not so aware how can I change it from the glue context and how I can apply the processing on the streaming data then writeStream it?
any idea please?

forEachBatch method processes streaming Dataset/DataFrame on batches of Dataset/DataFrame, so when we call it on data_frame_DataSource0, the df passed to the method processBatch is a normal Dataset/DataFrame contains a batch of data.
You have two options to fix this:
deal with the df as normal dataframe:
df.write.format("hudi").options(**combinedConf).mode("append").save()
apply your stream processing directly on data_frame_DataSource0:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
(
data_frame_DataSource0.writeStream.format("hudi")
.options(**combinedConf)
.option("inferSchema", "true")
.option("startingPosition", starting_position_of_kinesis_iterator)
.outputMode('append').start()
)

Related

How to process spark JavaRDD data with apache beam?

I want to process the data from spark JavaRDD Object that I am retrieving from sparksession.sql(" query ") with Apache beam. But I am not able to apply PTransform to this Dataset directly.
I am using Apache Beam 2.14.0(Upgraded Spark runner to use spark version 2.4.3. (BEAM-7265)). Please guide me for this.
SparkSession session = SparkSession.builder().appName("test 2.0").master("local[*]").getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
final SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setRunner(SparkRunner.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
options.setEnableSparkMetricSinks(false);
Pipeline pipeline = Pipeline.create(options);
List<StructField> srcfields = new ArrayList<StructField>();
srcfields.add(DataTypes.createStructField("dataId", DataTypes.IntegerType, true));
srcfields.add(DataTypes.createStructField("code", DataTypes.StringType, true));
srcfields.add(DataTypes.createStructField("value", DataTypes.StringType, true));
srcfields.add(DataTypes.createStructField("dataFamilyId", DataTypes.IntegerType, true));
StructType dataschema = DataTypes.createStructType(srcfields);
List<Row> dataList = new ArrayList<Row>();
dataList.add(RowFactory.create(1, "AA", "Apple", 1));
dataList.add(RowFactory.create(2, "AB", "Orange", 1));
dataList.add(RowFactory.create(3, "AC", "Banana", 2));
dataList.add(RowFactory.create(4, "AD", "Guava", 3));
Dataset<Row> rawData = new SQLContext(jsc).createDataFrame(dataList, dataschema);//pipeline.getOptions().getRunner().cast();
JavaRDD<Row> javadata = rawData.toJavaRDD();
System.out.println("***************************************************");
for(Row line:javadata.collect()){
System.out.println(line.getInt(0)+"\t"+line.getString(1)+"\t"+line.getString(2)+"\t"+line.getInt(3));
}
System.out.println("***************************************************");
pipeline.apply(Create.of(javadata))
.apply(ParDo.of(new DoFn<JavaRDD<Row>,String> ()
{
#ProcessElement
public void processElement(ProcessContext c) {
JavaRDD<Row> row = c.element();
c.output("------------------------------");
System.out.println(".............................");
}
}
))
.apply("WriteCounts", TextIO.write().to("E:\\output\\out"));
final PipelineResult result = pipeline.run();
System.out.println();
System.out.println("***********************************end");
I don’t believe it’s possible since Beam is supposed to know nothing about Spark RDDs and Beam Spark Runner hides all Spark-related things under the hood. Potentially, you can create custom Spark specific PTransform, which will read from RDD, and use it as input of your pipeline for your specific cases but I'm not sure it's a good idea and, perhaps, it can be solved in other way. Could you share more details about your data processing pipeline?
There is no way to directly consume Spark Datasets or RDDs into Beam, but you should be able to ingest data from Hive into a Beam PCollection instead. See the docs for Beam's HCatalog IO connector: https://beam.apache.org/documentation/io/built-in/hcatalog/

How to access the data from streaming query in "memory" table for subsequent batch queries?

Given a writeStream call:
val outDf = (sdf.writeStream
.outputMode(outputMode)
.format("memory")
.queryName("MyInMemoryTable")
.trigger(Trigger.ProcessingTime(interval))
.start())
How can I run a sql against the MyInMemoryTable e.g.
val df = spark.sql("""select Origin,Dest,Carrier,avg(DepDelay) avgDepDelay
from MyInMemoryTable group by 1,2,3""")
The documentation for Spark Structured Streaming says that batch and streaming queries can be intermixed but the above is not working:
'writeStream' can be called only on streaming Dataset/DataFrame;
org.apache.spark.sql.AnalysisException: 'writeStream' can be called only
on streaming Dataset/DataFrame;
So how can the InMemoryTable be used in subsequent queries?
The following post on Hortonworks site has an approach that seems promising https://community.hortonworks.com/questions/181979/spark-structured-streaming-formatmemory-is-showing.html
Here is the sample writeStream - which is of the same form as my original question:
StreamingQuery initDF = df.writeStream()
.outputMode("append")
.format("memory")
.queryName("initDF")
.trigger(Trigger.ProcessingTime(1000))
.start();
sparkSession.sql("select * from initDF").show();
initDF.awaitTermination();
And here is the response:
Okay,the way it works is :
In simple terms,think that The main Thread of your code launches
another thread in which your streamingquery logic runs.
meanwhile ,your maincode is blocking due to
initDF.awaitTermination().
sparkSession.sql("select * from initDF").show() => This code run on the mainthread ,and it reaches there only for the first time.
So update your code to :
StreamingQuery initDF = df.writeStream() .outputMode("append") .format("memory") .queryName("initDF") .trigger(Trigger.ProcessingTime(1000)) .start();
while(initDF.isActive){
Thread.sleep(10000)
sparkSession.sql("select * from initDF").show()
}
Now the main thread of your code will be going through the loop over and over again and it queries the table.
Applying the suggestions to my code results in :
while(outDf.isActive) {
Thread.sleep(30000)
strmSql(s"select * from $table", doCnt = false, show = true, nRows = 200)
}
outDf.awaitTermination(1 * 20000)
Update This worked great: I am seeing updated results after each mini batch.

Spark Streaming and Kafka: Best way to read file from HDFS

scenario
we expect to receive CSV files (10 MB approx), which are stored in HDFS. The process will then send a message to a Kafka topic (message contains file metadata like HDFS location etc).
A spark streaming job listens to this Kafka topic, on receiving a message it should then read the file from HDFS and process the file.
What is the most efficient way to read the file from HDFS in the above scenario?
read from Kafka
JavaInputStream<ConsumerRecord<String, FileMetaData>> messages = KafkaUtils.createDirectStream(...);
JavaDStream<FileMetaData> files = messages.map(record -> record.value());
option1 - use flatmap function
JavaDStream<String> allRecords = files.flatMap(file -> {
ArrayList<String> records = new ArrayList<>();
Path inFile = new Path(file.getHDFSLocation());
// code to read file from HDFS
return records;
});
option2 - use foreachRDD
ArrayList<String> records = new ArrayList<>();
files.foreachRDD(rdd -> {
rdd.foreachPartition(part -> {
while(part.hasNext()) {
Path inFile = new Path(part.next().getHDFSLocation());
// code to read file from HDFS
records.add(...);
}
}
}
JavaRDD<String> rddRecords = javaSparkContext.sparkContext().parallize(records);
which option is better?
Also, should I be using the spark context's built-in methods to read the file from HDFS instead of using HDFS Path?
Thanks

How to pull data from S3 using Spark

I have a bunch of csv files containing time and space dependent dats in an AWS S3 bucket. The files are prefixed with timestamps on 5mins granularity.
When trying to access them from AWS EMR with Apache Spark and trying to filter them both time and space even beefy clusters (5 x r3.8xlarge) are crashing. The data for I'm trying to do the filtering with broadcast join.
Location is a class with a userid, timestamp and a mobile cell information which I'm trying to join with the cell position information (segmentDF) to filter only those records that are required.
I need further processing of these records, here just try to save them as parquet. I feel there must be a more efficient way of doing it, starting from storing the data in an S3 bucket. Any ideas are appreciated.
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 suggests an alternative and faster way of accessing S3 buckets from spark which I could not implement (see below for code and error rep).
\\ Scala code for the filtering
val locationDF = sc.textFile(s"bucket/location_files/201703*")
.map(line => {
val l = new Location(line)
(l.id, l.time, l.cell)
})
.toDF("id", "time", "cell")
val df = locationDF.join(broadcast(segmentDF), Seq("cell"), "inner").select($"id", $"time", $"lat", $"lng", $"cellName").repartition(32)
df.write.save("somewhere/201703.parquet")
\\ Alternative way of accessing S3 keys
import com.amazonaws.services.s3._, model._
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
val credentials = new DefaultAWSCredentialsProviderChain().getCredentials
val request = new ListObjectsRequest()
request.setBucketName("s3-eu-west-1.amazonaws.com/bucket")
request.setPrefix("location_files")
request.setMaxKeys(32000)
def s3 = new AmazonS3Client(new BasicAWSCredentials(credentials.getAWSAccessKeyId, credentials.getAWSSecretKey))
val objs = s3.listObjects(request)
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
Latter ends up with error com.amazonaws.services.s3.model.AmazonS3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. (Service: Amazon S3; Status Code: 301; Error Code: PermanentRedirect; Request ID: DAE08BA90C01EB5E)

Why memory sink writes nothing in append mode?

I used Spark's structured streaming to stream messages from Kafka. The data was then aggregated, and wrote to a memory sink with append mode. However, when I tried to query the memory, it returned nothing. Below are the code:
result = model
.withColumn("timeStamp", col("startTimeStamp").cast("timestamp"))
.withWatermark("timeStamp", "5 minutes")
.groupBy(window(col("timeStamp"), "5 minutes").alias("window"))
.agg(
count("*").alias("total")
);
// writing to memory
StreamingQuery query = result.writeStream()
.outputMode(OutputMode.Append())
.queryName("datatable")
.format("memory")
.start();
// query data in memory
new Timer().scheduleAtFixedRate(new TimerTask() {
#Override
public void run() {
sparkSession.sql("SELECT * FROM datatable").show();
}
}, 10000, 10000);
The result is always:
|window|total|
+------+-----+
+------+-----+
If I used outputMode = complete, then I could get the aggregated data. But that's not my choice as the requirement is to use append mode.
Is there any problem with the code?
Thanks,
In append mode,
The output of a windowed aggregation is delayed the late threshold specified in withWatermark()
In your case, the delay is 5 minutes, I know nothing about your input data, but I guess you probably need to wait for 5 minutes.
I suggest you read (again?) the docs for Structured Streaming: