I want to process the data from spark JavaRDD Object that I am retrieving from sparksession.sql(" query ") with Apache beam. But I am not able to apply PTransform to this Dataset directly.
I am using Apache Beam 2.14.0(Upgraded Spark runner to use spark version 2.4.3. (BEAM-7265)). Please guide me for this.
SparkSession session = SparkSession.builder().appName("test 2.0").master("local[*]").getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
final SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setRunner(SparkRunner.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
options.setEnableSparkMetricSinks(false);
Pipeline pipeline = Pipeline.create(options);
List<StructField> srcfields = new ArrayList<StructField>();
srcfields.add(DataTypes.createStructField("dataId", DataTypes.IntegerType, true));
srcfields.add(DataTypes.createStructField("code", DataTypes.StringType, true));
srcfields.add(DataTypes.createStructField("value", DataTypes.StringType, true));
srcfields.add(DataTypes.createStructField("dataFamilyId", DataTypes.IntegerType, true));
StructType dataschema = DataTypes.createStructType(srcfields);
List<Row> dataList = new ArrayList<Row>();
dataList.add(RowFactory.create(1, "AA", "Apple", 1));
dataList.add(RowFactory.create(2, "AB", "Orange", 1));
dataList.add(RowFactory.create(3, "AC", "Banana", 2));
dataList.add(RowFactory.create(4, "AD", "Guava", 3));
Dataset<Row> rawData = new SQLContext(jsc).createDataFrame(dataList, dataschema);//pipeline.getOptions().getRunner().cast();
JavaRDD<Row> javadata = rawData.toJavaRDD();
System.out.println("***************************************************");
for(Row line:javadata.collect()){
System.out.println(line.getInt(0)+"\t"+line.getString(1)+"\t"+line.getString(2)+"\t"+line.getInt(3));
}
System.out.println("***************************************************");
pipeline.apply(Create.of(javadata))
.apply(ParDo.of(new DoFn<JavaRDD<Row>,String> ()
{
#ProcessElement
public void processElement(ProcessContext c) {
JavaRDD<Row> row = c.element();
c.output("------------------------------");
System.out.println(".............................");
}
}
))
.apply("WriteCounts", TextIO.write().to("E:\\output\\out"));
final PipelineResult result = pipeline.run();
System.out.println();
System.out.println("***********************************end");
I don’t believe it’s possible since Beam is supposed to know nothing about Spark RDDs and Beam Spark Runner hides all Spark-related things under the hood. Potentially, you can create custom Spark specific PTransform, which will read from RDD, and use it as input of your pipeline for your specific cases but I'm not sure it's a good idea and, perhaps, it can be solved in other way. Could you share more details about your data processing pipeline?
There is no way to directly consume Spark Datasets or RDDs into Beam, but you should be able to ingest data from Hive into a Beam PCollection instead. See the docs for Beam's HCatalog IO connector: https://beam.apache.org/documentation/io/built-in/hcatalog/
Related
I am having glue streaming job, and I need to write the data as stream but after applying some processing, so I did the following:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
glueContext.forEachBatch(
frame=data_frame_DataSource0,
batch_function=processBatch,
options={
"windowSize": window_size,
"checkpointLocation": s3_path_spark
}
)
and in processBatch I do some processing and at the end of it i do the following:
df.writeStream.format("hudi").options(**combinedConf).outputMode('append').start()
I am getting the following error:
pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram
as far as I unserstand that the df I am trying to write is not streaming that's why it's giving the error, I am not so aware how can I change it from the glue context and how I can apply the processing on the streaming data then writeStream it?
any idea please?
forEachBatch method processes streaming Dataset/DataFrame on batches of Dataset/DataFrame, so when we call it on data_frame_DataSource0, the df passed to the method processBatch is a normal Dataset/DataFrame contains a batch of data.
You have two options to fix this:
deal with the df as normal dataframe:
df.write.format("hudi").options(**combinedConf).mode("append").save()
apply your stream processing directly on data_frame_DataSource0:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
(
data_frame_DataSource0.writeStream.format("hudi")
.options(**combinedConf)
.option("inferSchema", "true")
.option("startingPosition", starting_position_of_kinesis_iterator)
.outputMode('append').start()
)
I am trying to implement a solution in Spark Structured Streaming that refreshes a static dataset every 5 minutes and joins with a streaming dataset that is running every 10 seconds.
I have tried to follow the solution marked on Structured streaming with periodically updated static dataset
This is my code:
Dataset<Row> kafkaDS = sparkSession.readStream().format("kafka")
.load()
.select(from_avro(col("value"), abrisConfig).as("event"))
.select("event.*");
AtomicReference<Dataset<Row>> cachedDS = new AtomicReference<>();
cachedDS.set(ss.read().format("jdbc")
.options(<options here>)
.load()
.repartition(<repartition columns>)
.sortWithinPartitions(<repartition columns>)
.persist());
VoidFunction2<Dataset<Long>, Long> refresh = new VoidFunction2<Dataset<Long>, Long>() {
private static final long serialVersionUID = -6106370300512290330L;
#Override
public void call(Dataset<Long> v1, Long v2) throws Exception {
System.out.println("Refreshing cache");
cachedDS.get().unpersist();
cachedDS.set(ss.read().format("jdbc")
.options(<options here>)
.load()
.repartition(<repartition columns>)
.sortWithinPartitions(<repartition columns>)
.persist());
}
};
Dataset<Long> staticRefreshStream = ss.readStream()
.format("rate")
.option("rowsPerSecond", 1)
.option("numPartitions", 1)
.load()
.selectExpr("CAST(value as LONG) as trigger")
.as(Encoders.LONG());
//We do here JOIN
Dataset<TcrDenegadasFiltroValue> leftJoinDS = cachedDS.get().join(kafkaDS)//Avoid details about the join
staticRefreshStream.writeStream()
.outputMode("append")
.foreachBatch(refresh)
.queryName("refreshCache")
.trigger(Trigger.ProcessingTime(5, TimeUnit.MINUTES))
.start();
StreamingQuery ds = leftJoinDS
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream()
.format("kafka")
.options(<kafka options>)
.start();
The following code runs with Spark 2.4.5, in a kubernetes environment with 3 executor pods and a driver pod.
The problem is that when I start the application, my main stream, the one that is running every 10 seconds, does the join with the cached dataset, but it seems that it does not know that the static dataset is cached, because it is not reading it from the cache and takes a while to do the job.
Can you help with it?
I am reading 500 MB random tuples from Kafka producer continuous and in a storm topology I am inserting it to MongoDb using Mongo Java Driver. The problem is I am getting really low throughput as 4-5 tuples per second.
Without DB insert if I write a simple print statement I get throughput as 684 tuples per second. I am planning to run 1Million records from Kafka and check the throughput with mongo insert.
I tried to tune using config setMaxSpoutPending , setMessageTimeoutSecs parms in kafkaconfig.
final SpoutConfig kafkaConf = new SpoutConfig(zkrHosts, kafkaTopic, zkRoot, clientId);
kafkaConf.ignoreZkOffsets=false;
kafkaConf.useStartOffsetTimeIfOffsetOutOfRange=true;
kafkaConf.startOffsetTime=kafka.api.OffsetRequest.LatestTime();
kafkaConf.stateUpdateIntervalMs=2000;
kafkaConf.scheme = new SchemeAsMultiScheme(new StringScheme());
final TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("kafka-spout", new KafkaSpout(kafkaConf), 1);
topologyBuilder.setBolt("print-messages", new MyKafkaBolt()).shuffleGrouping("kafka-spout");
Config conf = new Config();
conf.setDebug(true);
conf.setMaxSpoutPending(1000);
conf.setMessageTimeoutSecs(30);
Execute method of bolt
JSONObject jObj = new JSONObject();
jObj.put("key", input.getString(0));
if (null !=jObj && jObj.size() > 0 ) {
final DBCollection quoteCollection = dbConnect.getConnection().getCollection("stormPoc");
if (quoteCollection != null) {
BasicDBObject dbObject = new BasicDBObject();
dbObject.putAll(jObj);
quoteCollection.insert(dbObject);
// logger.info("inserted in Collection !!!");
} else {
logger.info("Error while inserting data in DB!!!");
}
collector.ack(input);
There is a storm-mongodb module for integration with Mongo. Does it not do the job? https://github.com/apache/storm/tree/b07413670fa62fec077c92cb78fc711c3bda820c/external/storm-mongodb
You shouldn't use storm-kafka for Kafka integration, it is deprecated. Use storm-kafka-client instead.
Setting conf.setDebug(true) will impact your processing, as Storm will log a fairly huge amount of text per tuple.
I don't know how to pass SparkSession parameters programmatically when submitting Spark job to Apache Livy:
This is the Test Spark job:
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// ...
}
}
This is how this Spark job is submitted to Livy:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
How can I pass the following configuration parameters to SparkSession?
.config("es.nodes","1localhost")
.config("es.port",9200)
.config("es.nodes.wan.only","true")
.config("es.index.auto.create","true")
You can do that easily through the LivyClientBuilder like this:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("key", "value")
.build()
Configuration parameters can be set to LivyClientBuilder using
public LivyClientBuilder setConf(String key, String value)
so that your code starts with:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("es.port",9200)
.setConf("es.nodes.wan.only","true")
.setConf("es.index.auto.create","true")
.build()
LivyClientBuilder.setConf will not work, I think. Because Livy will modify all configs not starting with spark.. And Spark cannot read the modified config. See here
private static File writeConfToFile(RSCConf conf) throws IOException {
Properties confView = new Properties();
for (Map.Entry<String, String> e : conf) {
String key = e.getKey();
if (!key.startsWith(RSCConf.SPARK_CONF_PREFIX)) {
key = RSCConf.LIVY_SPARK_PREFIX + key;
}
confView.setProperty(key, e.getValue());
}
...
}
So the answer is quite simple: add spark. to all es configs, like this,
.config("spark.es.nodes","1localhost")
.config("spark.es.port",9200)
.config("spark.es.nodes.wan.only","true")
.config("spark.es.index.auto.create","true")
Don't know it is elastic-spark does the compatibility job, or spark. It just works.
PS: I've tried with the REST API, and it works. But not with the Programmatic API.
I am using Apache Spark to take real time data from Apache Kafka which are from any sensors in Json format.
example of data format :
{
"meterId" : "M1",
"meterReading" : "100"
}
I want to apply rule to raise alert in real time. i.e. if I did not get data of "meter M 1" from last 2 hours or meter Reading exceed some limit the alert should be created.
so how can I achieve this in Scala?
I will respond here as an answer - too long for comment.
As I said json in kafka should be: one message per one line - send this instead -> {"meterId":"M1","meterReading":"100"}
If you are using kafka there is KafkaUtils with that you can create stream:
JavaPairDStream<String, String> input = KafkaUtils.createStream(jssc, zkQuorum, group, topics);
Pair means <kafkaTopicName, JsonMessage>. So basically you can take a look only to jsonmessage if you dont need to use kafkaTopicName.
for input you can use many methods that are described in JavaPairDStream documentation - eg. you can use map to get only messages to simple JavaDStream.
And of course you can use some json parser like gson, jackson or org.json it depends on use cases, performance for different cases and so on.
So you need to do something like this:
JavaDStream<String> messagesOnly = input.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
now you have only messages withou kafka topic name, now you can use your logic like you described in question.
JavaPairDStream<String, String> alerts = messagesOnly.filter(
new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> message) {
// here use gson parser e.g
// filter messages with meterReading that doesnt exceed limit
// return true or false based on your logic
}
}
);
And here you have only alert messages - you can send it to another place.
-- AFTER EDIT
Below is the example in scala
// batch every 2 seconds
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
def filterLogic(message: String): Boolean=
{
// here your logic for filtering
}
// map _._2 takes your json messages
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
// filtered data after filter transformation
val filtered = messages.filter(m => filterLogic(m))