Spark Streaming and Kafka: Best way to read file from HDFS - apache-kafka

scenario
we expect to receive CSV files (10 MB approx), which are stored in HDFS. The process will then send a message to a Kafka topic (message contains file metadata like HDFS location etc).
A spark streaming job listens to this Kafka topic, on receiving a message it should then read the file from HDFS and process the file.
What is the most efficient way to read the file from HDFS in the above scenario?
read from Kafka
JavaInputStream<ConsumerRecord<String, FileMetaData>> messages = KafkaUtils.createDirectStream(...);
JavaDStream<FileMetaData> files = messages.map(record -> record.value());
option1 - use flatmap function
JavaDStream<String> allRecords = files.flatMap(file -> {
ArrayList<String> records = new ArrayList<>();
Path inFile = new Path(file.getHDFSLocation());
// code to read file from HDFS
return records;
});
option2 - use foreachRDD
ArrayList<String> records = new ArrayList<>();
files.foreachRDD(rdd -> {
rdd.foreachPartition(part -> {
while(part.hasNext()) {
Path inFile = new Path(part.next().getHDFSLocation());
// code to read file from HDFS
records.add(...);
}
}
}
JavaRDD<String> rddRecords = javaSparkContext.sparkContext().parallize(records);
which option is better?
Also, should I be using the spark context's built-in methods to read the file from HDFS instead of using HDFS Path?
Thanks

Related

Spring batch : Inputstream is getting closed in spring batch reader if writer takes more than 5 minutes in processing

Whats need to be achieve: Read a csv from sftp location and write again to different path and also save to db with Spring batch in springboot app.
Issue : 1. Reader is getting executed only once and writer as per the chunk size like print statement in reader gets printed only one time while in writer per chunk execution. It seems to be the default behaviour of FlatfileItemReader.
I am using SFTP channel to read the file from sftp location which is getting closed after read if writer processing time is huge.
So is there a way, I can always pass a new SFTP connection for eech chunk size or is there a way I can extend this readers input stream timeout as I dont see any option of timeout. In sftp configuration, I already tried increasing the timeout and idle time but of no use.
I have tried creating the new sftp connection in reader and pass it to stream but as reader is only getting initialized one, this does not help.
I already tried increasing the timeout and idle time but of no use.
Reader snippet:
private Step step (FileInputDTO input,
Map<String, Float> ratelist) throws
SftpException { return
stepBuilderFactory.get("Step").<DTO,
DTO>chunk(chunkSize)
.reader(buildReader(input)).writer(new
Writer(input, fileUtil, ratelist,
mapper,service))
.taskExecutor (taskExecutor)
.listener(stepListener)
.build();
}
private FlatFileItemReader<? extends
DTO> buildReader(FileInputDTO input)
throws SftpException {
//Create reader instance
FlatFileItemReader<DTO> reader = new
FlatFileItemReader<>();
log.info("reading file :starts" );"
//Set input file location
reader.setResource(new InputStreamResource(input.getChannel().get(input.getPath())));
//Set number of Lines to ships. Use it if file has header rows.
reader.setLinesToSkip(1);
//Other code
return reader;
}
SFTP configuration:
public SFTPUtil (Environment env, String sftpPassword) throws JSchException {
JSch jsch = new JSch();
log.debug("Creating SFTP channelSftp");
Session session = jSch.getSession(env.getProperty("sftp.remoteUserName"),
env.getProperty("sftp.remoteHost"), Integer.parseInt(env.getProperty("sftp.remotePort"))); session.setConfig(Constants.STRICT_HOST_KEY_CHECKING, Constants.NO);
session.setPassword(sftpPassword);
session.connect(Integer.parseInt(env.getProperty("sftp.sessionTimeout")));
Channel sftpchannel = session.openChannel (Constants.SFTP);
sftpchannel.connect(Integer.parseInt(env.getProperty("sftp.channelTimeout")));
this.channel = (ChannelSftp) sftpchannel;
log.debug("SFTP channelSftp connected");
}
public ChannelSftp get() throws CustomException {
if(channel == null) throw new CustomException("Channel creation failed");
return channel;
}

flink Pipeline not processing Kafka messages after switching source

I have a use case to build a stateful application. I need to build state with historical data stored in S3 and switch to kafka ( from a particular offset) with continuing historical state.
We have beam pipeline having runner as flink. Below are the steps I am doing.
Process all the files and build a state.
Stop flink job with savepoint
Start flink job with state saved in step2 and switch to kafka as source
My pipeline is not processing any messages in step3. When I check flink UI I do observe the watermark is set as MAX_WATERMARK (9223372036854776000) for stateful transformation block. I am looking for solution if there is way we can override this watermark and set it to required offset.
Below are sample code and topology. For POC I am reading data from local files and I am getting file names from a kafka.
I am using
flink version 1.9.3
beam version 2.23.0
try {
PCollection<String> records = null;
boolean isSource1 = runningMode.equals("source1") ? true : false;
if (isSource1) {
PCollection<String> absolutePaths = pipeline
.apply("read from source topic 1", KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("file-topic")
.withKeyDeserializer(org.apache.kafka.common.serialization.LongDeserializer.class)
.withValueDeserializer(org.apache.kafka.common.serialization.StringDeserializer.class))
.apply(MapElements.into(TypeDescriptor.of(String.class)).via(record -> {
String folder = record.getKV().getValue();
String path = "file:///tmp/files/" + folder + "/*";
System.out.println("path -> " + path);
return path;
}));
records = absolutePaths
.apply("read file names", FileIO.matchAll())
.apply("match file names", FileIO.readMatches())
.apply("read data from files", TextIO.readFiles());
} else {
records = pipeline
.apply("read from source topic 2", KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("data-topic")
.withStartReadTime(Instant.parse("2021-09-01T09:30:00-00:00"))
.withKeyDeserializer(org.apache.kafka.common.serialization.LongDeserializer.class)
.withValueDeserializer(org.apache.kafka.common.serialization.StringDeserializer.class)
)
.apply(MapElements.into(TypeDescriptor.of(String.class)).via(record -> record.getKV().getValue()));
}
PCollection<String> output = records.apply("counter", new ClientTransformation());
return output.apply("writing to output topic", KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("output-topic").withValueSerializer(org.apache.kafka.common.serialization.StringSerializer.class)
.values());
} catch (Exception e) {
LOG.error("Failed to initialize pipeline due to missing coders", e.getMessage());
return null;
}
topology in step 1
topology in step 3

Spark Structured Streaming: Running Kafka consumer on separate worker thread

So I have a Spark application that needs to read two streams from two kafka clusters (Kafka A and B) using structured streaming, and do some joins and filtering on the two streams. So is it possible to have a Spark job that reads stream from A, and also run a thread (Called consumer) on each worker that reads Kafka B and put data into a map. So later when we are filtering we can do something like stream.filter(row => consumer.idNotInMap(row.id))?
I have some questions regarding this approach:
If this approach works, will it cause any problems when the application is run on a cluster?
Will all consumer instance on each worker receive the same data in cluster mode? Or can we even let each consumer only listen on the Kafka partition for that worker node (which is probably controlled by Spark)?
How will the consumer instance gets serialized and passed to workers?
Currently they are initialized on Driver node but are there some ways to initialize it once for each worker node?
I feel like in my case I should use stream joining instead. I've already tried that and it didn't work, that's why I am taking this approach. It didn't work because stream from Kafka A is append only and stream B needs to have a state that can be updated, which makes it update only. Then joining streams of append and update mode is not supported in Spark.
Here are some pseudo-code:
// SparkJob.scala
val consumer = Consumer()
val getMetadata = udf(id => consumer.get(id))
val enrichedDataSet = stream.withColumn("metadata", getMetadata(stream("id"))
// Consumer.java
class Consumer implements Serializable {
private final ConcurrentHashMap<Integer, String> metadata;
public MetadataConsumer() {
metadata = new ConcurrentHashMap<>();
// read stream
listen();
}
// process kafka data inside this loop
private void listen() {
Thread t = new Thread(() -> {
KafkaConsumer consumer = ...;
while (consumer.hasNext()) {
var message = consumer.next();
// update metadata or put in new metadata
metadata.put(message.id, message);
}
});
t.start();
}
public String get(Integer key) {
return metadata.get(key);
}
}

Publish an AVRO file to a kafka topic

I have a a file contains the data in an AVRO format data need to publish directly to a kafka topic. Is there any utilities available without much data parsing in my code? Using kafka 1.0 version.
You can read the data from the AVRO file and then serialize it into bytes array.
final Schema avroSchema = new Schema.Parser().parse(new File("yourAvroSchema.avsc"));
File avroFile="yourAvroFile.avro"
// Read as GenericRecord
final GenericDatumReader<GenericRecord> genericDatumReader = new GenericDatumReader<>(avroSchema );
final DataFileReader<GenericRecord> genericRecords = new DataFileReader<>(avroFile, genericDatumReader);
// Serialization
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(avroSchema);
Encoder binaryEncoder = EncoderFactory.get().binaryEncoder(out, null);
while (genericRecords.hasNext()) {
writer.write(genericRecords.next(), binaryEncoder);
}
binaryEncoder.flush();
out.close();
// ....

push data from text file to Kafka continuously

i have create a simple producer that read data from textfile and send it to kafka
try(BufferedReader br = new BufferedReader(new FileReader(getInputFileName()))) {
String line = br.readLine();
while (line != null) {
KeyedMessage<String, String> data = new KeyedMessage<String, String>(getTopic(), null, line);
producer.send(data);
System.out.println(line);
//Thread.sleep(200l);
line = br.readLine();
and it is working perfectly but it only buffer the data at that time and send it , so if some one change the Textfile and add new line , these new data will not be sent to the kafka
i need to know if i can do something that will continuously capture the new lines that inserted to the textfile and send it automatically to kafka .
any help ?
From version 0.9, Kafka Connect is part of Apache Kafka.
Also by default it support FileStreamSource as one of the source connector.
For detailed example check this link