How to push datastream to kafka topic by retaining the order using string method in Flink Kafka Problem - apache-kafka

I am trying to create a JSON dataset every 500 ms and want to push it to the Kafka topic so that I can set up some windows in the downstream and perform computations. Below is my code:
package KafkaAsSource
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.datastream.DataStream
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaProducer}
import org.apache.flink.streaming.connectors.kafka.internals.KeyedSerializationSchemaWrapper
import java.time.format.DateTimeFormatter
import java.time.LocalDateTime
import java.util.{Optional, Properties}
object PushingDataToKafka {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setMaxParallelism(256)
env.enableCheckpointing(5000)
val stream: DataStream[String] = env.fromElements(createData())
stream.addSink(sendToTopic(stream))
}
def getProperties(): Properties = {
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
return properties
}
def createData(): String = {
val minRange: Int = 0
val maxRange: Int = 1000
var jsonData = ""
for (a <- minRange to maxRange) {
jsonData = "{\n \"id\":\"" + a + "\",\n \"Category\":\"Flink\",\n \"eventTime\":\"" + DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS").format(LocalDateTime.now) + "\"\n \n}"
println(jsonData)
Thread.sleep(500)
}
return jsonData
}
def sendToTopic(): Properties = {
val producer = new FlinkKafkaProducer[String](
"topic"
,
new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema())
,
getProperties(),
FlinkKafkaProducer.Semantic.EXACTLY_ONCE
)
return producer
}
}
It gives me below error:
type mismatch;
found : Any
required: org.apache.flink.streaming.api.functions.sink.SinkFunction[String]
stream.addSink(sendToTopic())
Modified Code:
object FlinkTest {
def main(ars: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setMaxParallelism(256)
var stream = env.fromElements("")
//env.enableCheckpointing(5000)
//val stream: DataStream[String] = env.fromElements("hey mc", "1")
val myProducer = new FlinkKafkaProducer[String](
"maddy", // target topic
new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema()), // serialization schema
getProperties(), // producer config
FlinkKafkaProducer.Semantic.EXACTLY_ONCE)
val minRange: Int = 0
val maxRange: Int = 10
var jsonData = ""
for (a <- minRange to maxRange) {
jsonData = "{\n \"id\":\"" + a + "\",\n \"Category\":\"Flink\",\n \"eventTime\":\"" + DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS").format(LocalDateTime.now) + "\"\n \n}"
println(a)
Thread.sleep(500)
stream = env.fromElements(jsonData)
println(jsonData)
stream.addSink(myProducer)
}
env.execute("hey")
}
def getProperties(): Properties = {
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
return properties
}
/*
def createData(): String = {
val minRange: Int = 0
val maxRange: Int = 10
var jsonData = ""
for (a <- minRange to maxRange) {
jsonData = "{\n \"id\":\"" + a + "\",\n \"Category\":\"Flink\",\n \"eventTime\":\"" + DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS").format(LocalDateTime.now) + "\"\n \n}"
Thread.sleep(500)
}
return jsonData
}
*/
}
Modified Code gives me the data in the Kafka topic but it doesn't retain the order. What am I doing wrong here in the loops? Also, had to change the version of Flink to 1.12.2 from 1.13.5.
I was initially using Flink 1.13.5, Connectors and Scala of 2.11. What exactly I am missing over here?

A couple of things about this loop:
for (a <- minRange to maxRange) {
jsonData =
"{\n \"id\":\"" + a + "\",\n \"Category\":\"Flink\",\n \"eventTime\":\""
+ DateTimeFormatter
.ofPattern("yyyy-MM-dd HH:mm:ss.SSS")
.format(LocalDateTime.now) + "\"\n \n}"
println(a)
Thread.sleep(500)
stream = env.fromElements(jsonData)
println(jsonData)
stream.addSink(myProducer)
}
The sleep is happening in the Flink client, and only affects how long it takes the client to assemble the job graph before submitting it to the cluster. It has no effect on how the job runs.
This loop is creating 10 separate pipelines that will run independently, in parallel, all producing to the same Kafka topic. Those pipelines are going to race against each other.
To get the behavior you're looking for (a global ordering across a single pipeline) you'll want to produce all of the events from a single source (in order, of course), and run the job with a parallelism of one. Something like this would do it:
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
object FlinkTest {
def main(ars: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setParallelism(1)
val myProducer = ...
val jsonData = (i: Long) => ...
env.fromSequence(0, 9)
.map(i => jsonData(i))
.addSink(myProducer)
env.execute()
}
}
You can leave maxParallelism at 256 (or at its default value of 128); it's not particularly relevant here. The maxParallelism is the number of hash buckets that keyBy will hash the keys into, and it defines an upper limit on the scalability of the job.

Related

bug in scala in for each loop while trying to loop through consumer record's

Cannot resolve symbol for Foreach
import java.util._
import org.apache.kafka.clients.consumer._
import org.apache.kafka.common.serialization.Deserializer
object ConsumerExample {
def main(args: Array[String]): Unit = {
val T_Name = "CarSensor"
val T_Group_Name = "CarSensorGroup"
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092,localhost:9093,localhost:9094")
props.put("group.id",T_Group_Name)
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
val Kafka_Consumer=new KafkaConsumer[String,String](props)
Kafka_Consumer.subscribe (Arrays.asList(T_Name))
while(true)
{
val Consumer_Record=Kafka_Consumer.poll(100) //ConsumerRecords Object
// val RecordList=Consumer_Record.toString
for( i <- Consumer_Record)
{ //**This place is where Cannot resolve symbol for Foreach issue shows up for <- symbol.**
println("Supplier id = "+String.valueOf(i.value().getID())+ "Supplier name = " +i.value().getID())
}
}
}
}
I have used <- symbol in many examples before it worked.
I thought it was an issue with Intelliji and restarted it. Its a problem in object getting casted to different type I guess.
Consumer_Record.forEach(i => {
println("Supplier id = "+String.valueOf(i.value().getID())+ "Supplier name = " +i.value().getID())
})
works fine for me.
Except String doesn't have getID() method.
You can use for(i <- Consumer_Record.asScala) if you want for syntax, but you have to add import scala.collection.JavaConverters._.
val Kafka_Consumer=new KafkaConsumer[String,String](props)
Kafka_Consumer.subscribe(Arrays.asList(T_Name))
while(true) {
val Consumer_Record=Kafka_Consumer.poll(100) //ConsumerRecords Object
for( i <- Consumer_Record.asScala) {
println("Supplier id = "+String.valueOf(i.value())+ " Supplier name = " +i.key())
}
}

recursive value x$5 needs type

i am getting error at this line
val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) =
Utils.parseCommandLineWithTwitterCredentials(args)
recursive value x$7 needs
type
recursive value x$1 needs
type
what does this error meaning, please guide me how to resolve this error.
object Collect {
private var numTweetsCollected = 0L
private var partNum = 0
private var gson = new Gson()
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 3) {
System.err.println("Usage: " + this.getClass.getSimpleName +
"<outputDirectory> <numTweetsToCollect> <intervalInSeconds> <partitionsEachInterval>")
System.exit(1)
}
val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) =
Utils.parseCommandLineWithTwitterCredentials(args)
val outputDir = new File(outputDirectory.toString)
if (outputDir.exists()) {
System.err.println("ERROR - %s already exists: delete or specify another directory".format(
outputDirectory))
System.exit(1)
}
outputDir.mkdirs()
println("Initializing Streaming Spark Context...")
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(intervalSecs))
val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth)
.map(gson.toJson(_))
tweetStream.foreachRDD((rdd, time) => {
val count = rdd.count()
if (count > 0) {
val outputRDD = rdd.repartition(partitionsEachInterval)
outputRDD.saveAsTextFile(outputDirectory + "/tweets_" + time.milliseconds.toString)
numTweetsCollected += count
if (numTweetsCollected > numTweetsToCollect) {
System.exit(0)
}
}
})
ssc.start()
ssc.awaitTermination()
}
}
Try removing the Utils.IntParam(.. from your pattern matched values. Extract the values, then parse them separately.

Sorting a DStream and taking topN

I have some DStream in Spark Scala and I want to sort it then take the top N.
The problem is that whenever I try to run it I get NotSerializableException and the exception message says:
This is because the DStream object is being referred to from within the closure.
The problem is that I don't know how to solve it:
Here is my try:
package com.badrit.realtime
import java.util.Date
import com.badrit.drivers.UnlimitedSpaceTimeDriver
import com.badrit.model.{CellBuilder, DataReader, Trip}
import com.badrit.utility.Printer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Duration, Milliseconds, StreamingContext}
import scala.collection.mutable
object StreamingDriver {
val appName: String = "HotSpotRealTime"
val hostName = "localhost"
val port = 5050
val constrains = UnlimitedSpaceTimeDriver.constrains;
var streamingRate = 1;
var windowSize = 8;
var slidingInterval = 2;
val cellBuilder = new CellBuilder(constrains)
val inputFilePath = "/home/ahmedelgamal/Downloads/green_tripdata_2015-02.csv"
def prepareTestData(sparkStreamCtx: StreamingContext): InputDStream[Trip] = {
val sparkCtx = sparkStreamCtx.sparkContext
val textFile: RDD[String] = sparkCtx.textFile(inputFilePath)
val data: RDD[Trip] = new DataReader().getTrips(textFile)
val groupedData = data.filter(_.pickup.date.before(new Date(2015, 1, 2, 0, 0, 0)))
.groupBy(trip => trip.pickup.date.getMinutes).sortBy(_._1).map(_._2).collect()
printf("Grouped Data Count is " + groupedData.length)
var dataQueue: mutable.Queue[RDD[Trip]] = mutable.Queue.empty;
groupedData.foreach(trips => dataQueue += sparkCtx.makeRDD(trips.toArray))
printf("\n\nTest Queue size is " + dataQueue.size)
groupedData.zipWithIndex.foreach { case (trips: Iterable[Trip], index: Int) => {
println("Items List " + index)
val passengers: Array[Int] = trips.map(_.passengers).toArray
val cnt = passengers.length
println("Sum is " + passengers.sum)
println("Cnt is " + cnt)
val passengersRdd = sparkCtx.parallelize(passengers)
println("Mean " + passengersRdd.mean())
println("Stdv" + passengersRdd.stdev())
}
}
sparkStreamCtx.queueStream(dataQueue, true)
}
def cellCreator(trip: Trip) = cellBuilder.cellForCarStop(trip.pickup)
def main(args: Array[String]) {
if (args.length < 1) {
streamingRate = 1;
windowSize = 3 //2 hours 60 * 60 * 1000L
slidingInterval = 2 //0.5 hour 60 * 60 * 1000L
}
else {
streamingRate = args(0).toInt;
windowSize = args(1).toInt
slidingInterval = args(2).toInt
}
val sparkConf = new SparkConf().setAppName(appName).setMaster("local[*]")
val sparkStreamCtx = new StreamingContext(sparkConf, Milliseconds(streamingRate))
sparkStreamCtx.sparkContext.setLogLevel("ERROR")
sparkStreamCtx.checkpoint("/tmp")
val data: InputDStream[Trip] = prepareTestData(sparkStreamCtx)
val dataWindow = data.window(new Duration(windowSize), new Duration(slidingInterval))
//my main problem lies in the following line
val newDataWindow = dataWindow.transform(rdd => sparkStreamCtx.sparkContext.parallelize(rdd.take(10)))
newDataWindow.print
sparkStreamCtx.start()
sparkStreamCtx.awaitTerminationOrTimeout(1000)
}
}
I don't mind any other ways to sort a DStream and get its top N rather than my way.
You can use transform method in the DStream object then sort the input RDD and take n elements of it in a list, then filter the original RDD to be contained in this list.
val n = 10
val topN = result.transform(rdd =>{
val list = rdd.sortBy(_._1).take(n)
rdd.filter(list.contains)
})
topN.print

Create a Simple Kafka Consumer using Scala

I am currently learning Scala & was trying to create a SimpleConsumer for retrieving messages from a Kafka partition.
The consumer should be able to handle the following tasks:
Keep track of Offsets.
Figure out which Broker is the lead Broker for a topic and partition
Must be able to handle Broker leader changes.
I was able to find a very good documentation to create this consumer in Java (https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example).
Does anyone have a sample Scala code for creating this simpleconsumer or if you could refer me some documentation which will point me in the right direction it will be greatly appreciated.
Here is the sample code of a Simple Kafka consumer written in Scala. Got it working after few trial and errors.
package com.Kafka.Consumer
import kafka.api.FetchRequest
import kafka.api.FetchRequestBuilder
import kafka.api.PartitionOffsetRequestInfo
import kafka.common.ErrorMapping
import kafka.common.TopicAndPartition
import kafka.javaapi._
import kafka.javaapi.consumer.SimpleConsumer
import kafka.message.MessageAndOffset
import java.nio.ByteBuffer
import java.util.ArrayList
import java.util.Collections
import java.util.HashMap
import java.util.List
import java.util.Map
import SimpleExample._
//remove if not needed
import scala.collection.JavaConversions._
object SimpleExample {
def main(args: Array[String]) {
val example = new SimpleExample()
val maxReads = java.lang.Integer.parseInt(args(0))
val topic = args(1)
val partition = java.lang.Integer.parseInt(args(2))
val seeds = new ArrayList[String]()
seeds.add(args(3))
val port = java.lang.Integer.parseInt(args(4))
try {
example.run(maxReads, topic, partition, seeds, port)
} catch {
case e: Exception => {
println("Oops:" + e)
e.printStackTrace()
}
}
}
def getLastOffset(consumer: SimpleConsumer,
topic: String,
partition: Int,
whichTime: Long,
clientName: String): Long = {
val topicAndPartition = new TopicAndPartition(topic, partition)
val requestInfo = new HashMap[TopicAndPartition, PartitionOffsetRequestInfo]()
requestInfo.put(topicAndPartition, new PartitionOffsetRequestInfo(whichTime, 1))
val request = new kafka.javaapi.OffsetRequest(requestInfo, kafka.api.OffsetRequest.CurrentVersion, clientName)
val response = consumer.getOffsetsBefore(request)
if (response.hasError) {
println("Error fetching data Offset Data the Broker. Reason: " +
response.errorCode(topic, partition))
return 0
}
val offsets = response.offsets(topic, partition)
offsets(0)
}
}
class SimpleExample {
private var m_replicaBrokers: List[String] = new ArrayList[String]()
def run(a_maxReads: Int,
a_topic: String,
a_partition: Int,
a_seedBrokers: List[String],
a_port: Int) {
val metadata = findLeader(a_seedBrokers, a_port, a_topic, a_partition)
if (metadata == null) {
println("Can't find metadata for Topic and Partition. Exiting")
return
}
if (metadata.leader == null) {
println("Can't find Leader for Topic and Partition. Exiting")
return
}
var leadBroker = metadata.leader.host
val clientName = "Client_" + a_topic + "_" + a_partition
var consumer = new SimpleConsumer(leadBroker, a_port, 100000, 64 * 1024, clientName)
var readOffset = getLastOffset(consumer, a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime, clientName)
var numErrors = 0
//while (a_maxReads > 0) {
if (consumer == null) {
consumer = new SimpleConsumer(leadBroker, a_port, 100000, 64 * 1024, clientName)
}
val req = new FetchRequestBuilder().clientId(clientName).addFetch(a_topic, a_partition, readOffset,
100000)
.build()
val fetchResponse = consumer.fetch(req)
if (fetchResponse.hasError) {
numErrors += 1
val code = fetchResponse.errorCode(a_topic, a_partition)
println("Error fetching data from the Broker:" + leadBroker +
" Reason: " +
code)
if (numErrors > 5) //break
if (code == ErrorMapping.OffsetOutOfRangeCode) {
readOffset = getLastOffset(consumer, a_topic, a_partition, kafka.api.OffsetRequest.LatestTime, clientName)
//continue
}
consumer.close()
consumer = null
leadBroker = findNewLeader(leadBroker, a_topic, a_partition, a_port)
//continue
}
numErrors = 0
var numRead = 0
for (messageAndOffset <- fetchResponse.messageSet(a_topic, a_partition)) {
val currentOffset = messageAndOffset.offset
if (currentOffset < readOffset) {
println("Found an old offset: " + currentOffset + " Expecting: " +
readOffset)
//continue
}
readOffset = messageAndOffset.nextOffset
val payload = messageAndOffset.message.payload
val bytes = Array.ofDim[Byte](payload.limit())
payload.get(bytes)
println(String.valueOf(messageAndOffset.offset) + ": " + new String(bytes, "UTF-8"))
numRead += 1
// a_maxReads -= 1
}
if (numRead == 0) {
try {
Thread.sleep(1000)
} catch {
case ie: InterruptedException =>
}
}
//}
if (consumer != null) consumer.close()
}
private def findNewLeader(a_oldLeader: String,
a_topic: String,
a_partition: Int,
a_port: Int): String = {
for (i <- 0 until 3) {
var goToSleep = false
val metadata = findLeader(m_replicaBrokers, a_port, a_topic, a_partition)
if (metadata == null) {
goToSleep = true
} else if (metadata.leader == null) {
goToSleep = true
} else if (a_oldLeader.equalsIgnoreCase(metadata.leader.host) && i == 0) {
goToSleep = true
} else {
return metadata.leader.host
}
if (goToSleep) {
try {
Thread.sleep(1000)
} catch {
case ie: InterruptedException =>
}
}
}
println("Unable to find new leader after Broker failure. Exiting")
throw new Exception("Unable to find new leader after Broker failure. Exiting")
}
private def findLeader(a_seedBrokers: List[String],
a_port: Int,
a_topic: String,
a_partition: Int): PartitionMetadata = {
var returnMetaData: PartitionMetadata = null
for (seed <- a_seedBrokers) {
var consumer: SimpleConsumer = null
try {
consumer = new SimpleConsumer(seed, a_port, 100000, 64 * 1024, "leaderLookup")
val topics = Collections.singletonList(a_topic)
val req = new TopicMetadataRequest(topics)
val resp = consumer.send(req)
val metaData = resp.topicsMetadata
for (item <- metaData; part <- item.partitionsMetadata){
if (part.partitionId == a_partition) {
returnMetaData = part
//break
}
}
} catch {
case e: Exception => println("Error communicating with Broker [" + seed + "] to find Leader for [" +
a_topic +
", " +
a_partition +
"] Reason: " +
e)
} finally {
if (consumer != null) consumer.close()
}
}
if (returnMetaData != null) {
m_replicaBrokers.clear()
for (replica <- returnMetaData.replicas) {
m_replicaBrokers.add(replica.host)
}
}
returnMetaData
}
}
I built a simple kafka consumer and producer using scala.
consumer:
package com.kafka
import java.util.concurrent._
import java.util.{Collections, Properties}
import com.sun.javafx.util.Logging
import org.apache.kafka.clients.consumer.{ConsumerConfig, KafkaConsumer}
import scala.collection.JavaConversions._
class Consumer(val brokers: String,
val groupId: String,
val topic: String) extends Logging {
val props = createConsumerConfig(brokers, groupId)
val consumer = new KafkaConsumer[String, String](props)
var executor: ExecutorService = null
def shutdown() = {
if (consumer != null)
consumer.close()
if (executor != null)
executor.shutdown()
}
def createConsumerConfig(brokers: String, groupId: String): Properties = {
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000")
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000")
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props
}
def run() = {
consumer.subscribe(Collections.singletonList(this.topic))
Executors.newSingleThreadExecutor.execute(new Runnable {
override def run(): Unit = {
while (true) {
val records = consumer.poll(1000)
for (record <- records) {
System.out.println("Received message: (" + record.key() + ", " + record.value() + ") at offset " + record.offset())
}
}
}
})
}
}
object Consumer extends App{
val newArgs = Array("localhost:9092", "2","test")
val example = new Consumer(newArgs(0), newArgs(1), newArgs(2))
example.run()
}
producer:
package com.kafka
import java.util.{Date, Properties}
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.ProducerRecord
object Producer extends App{
val newArgs = Array("20","test","localhost:9092")
val events = newArgs(0).toInt
val topic = newArgs(1)
val brokers = newArgs(2)
val props = new Properties()
props.put("bootstrap.servers", brokers)
props.put("client.id", "producer")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val t = System.currentTimeMillis()
for (nEvents <- Range(0, events)) {
val key = "messageKey " + nEvents.toString
val msg = "test message"
val data = new ProducerRecord[String, String](topic, key, msg)
//async
//producer.send(data, (m,e) => {})
//sync
producer.send(data)
}
System.out.println("sent per second: " + events * 1000 / (System.currentTimeMillis() - t))
producer.close()
}

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file