How to receive zipped json data from RabbitMQ in scala - scala

Im using below scala code to receive messages from rabbitmq.
import com.rabbitmq.client.Channel
import com.rabbitmq.client.Connection
import com.rabbitmq.client.ConnectionFactory
import com.rabbitmq.client.ConsumerCancelledException
import com.rabbitmq.client.QueueingConsumer
val rabbitMQconnection = getRabbitMQConnection
val channel = rabbitMQconnection.createChannel()
val args = Map[String, AnyRef]("x-message-ttl" -> Long.box(40000))
channel.queueDeclare("test",true,false,false,args )
val consumer = new QueueingConsumer(channel)
channel.basicConsume("test",true,consumer)
var message: String = null
val delivery = consumer.nextDelivery()
message= new String(delivery.getBody(), StandardCharsets.UTF_8)
println("at consumer : " +message)
My input data from RabbitMq is zipped Json data. When I use the above code, Im unable to unzip the data. Could someone please let me know how to read zipped json data.
Thank you.

Related

How to Deseralize Avro response getting from Datastream Scala + apache Flink

I am Getting Avro Response from a Kafka Topic from Confluent and i am facing issues when i want to deseralize the response. Not Understanding the Syntax How i should define the Avro deserializer and use in my Kafka Source while reading.
Sharing the approach i am currently doing.
I have a topic In Confluent named employee which is producing message every 10 seconds and each message is seralized by avro schema registry in the Confluent.
I am trying to Read those messages in my scala program I was able to print the serialised messages in the code but not able to deserialize the messaged.
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.connector.kafka.source.KafkaSource
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer
import org.apache.flink.formats.avro.AvroDeserializationSchema
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import java.time.Duration
case class emp(
name: String,
age: Int,
)
object Main {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val schemaRegistryUrl = "http://localhost:8081"
val source = KafkaSource.builder[String].
setBootstrapServers("localhost:9092")
.setTopics("employee")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest)
.setValueOnlyDeserializer(new SimpleStringSchema())
.build
val streamEnv : DataStream[String] =
env.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "Kafka Source")
streamEnv.print()
env.execute("Example")
}
}
I tried the Approach of Defining the Avro deserializer in kafka source while reading
.setValueOnlyDeserializer(new AvroDeserializationSchema[emp](classOf[emp])
Had no luck in the above approach as well.
Rather than a AvroDeserializationSchema, you need to use a ConfluentRegistryAvroDeserializationSchema instead. The standard Avro deserializer doesn't understand what to do with the magic byte that the Confluent serializer includes.

Deserializing bytes to GenericRecord / Row

In the aim of using DataSourceV2 for reading some stored Parquet binary files that I have already their Spark schema, I am struggling to find a way to deserialize Parquet stream into GenericRecord / Row(s).
For Avro I found that we can do that using something like:
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
import org.apache.spark.sql.avro.AvroDeserializer
...
val datumReader = new GenericDatumReader[GenericRecord]()
val reader = DataFileReader.openReader(avroStream, datumReader)
val iterator = reader.iterator()
val record = iterator.next()
val row = AvroDeserializer(reader.getSchema, schema).deserialize(record)
where avroStream is a stream of bytes.
Is there some utility classes that can help?
Thanks for your help!

Simple Spark Scala Post to External Rest API Example

New to Spark Scala, I just want to read a json file and post the content to an external rest api server. Can anyone provide a simple example? or provide guidelines?
You probably do not want to use Spark for this. Spark is an analytical engine for processing large amounts of data - unless you're reading in massive amounts of json from hdfs, this task is more suitable for scala. You should look up ways to read a json file in scala, and send that content to a server in scala.
Here are some great places to get started:
Scala Read JSON file
https://alvinalexander.com/scala/how-to-send-json-post-data-to-restful-url-in-scala
The following is from the above URL:
import java.io._
import org.apache.commons._
import org.apache.http._
import org.apache.http.client._
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.DefaultHttpClient
import java.util.ArrayList
import org.apache.http.message.BasicNameValuePair
import org.apache.http.client.entity.UrlEncodedFormEntity
import com.google.gson.Gson
case class Person(firstName: String, lastName: String, age: Int)
object HttpJsonPostTest extends App {
// create our object as a json string
val spock = new Person("Leonard", "Nimoy", 82)
val spockAsJson = new Gson().toJson(spock)
// add name value pairs to a post object
val post = new HttpPost("http://localhost:8080/posttest")
val nameValuePairs = new ArrayList[NameValuePair]()
nameValuePairs.add(new BasicNameValuePair("JSON", spockAsJson))
post.setEntity(new UrlEncodedFormEntity(nameValuePairs))
// send the post request
val client = new DefaultHttpClient
val response = client.execute(post)
println("--- HEADERS ---")
response.getAllHeaders.foreach(arg => println(arg))
}

Deserialize set of Confluent-encoded Avro in one file

I have a file which has binary avro appended next to each other. I would like to read each record one by one. At the same time, I would like to read first few bytes from each record which holds the schema id and then deserialize it. I am able to skip those bytes, using below code, and use the fixed schema. It works for me. But I would like to read each one individually. It that possible?
val client = new SchemaRegistryClient("SCHEMA_REGISTRY_URL")
val schema = new Schema.Parser().parse(client.getSchema("TOPIC_NAME").get.toString)
val reader = new GenericDatumReader[GenericRecord](schema)
val filename = "MY_BINARY_AVRO.avro"
var fileContInBytes = Files.readAllBytes(Paths.get(filename))
val decoder = DecoderFactory.get.binaryDecoder(fileContInBytes, null)
while (!decoder.isEnd) {
decoder.skipFixed(5)
val rec = reader.read(null, decoder)
}
Python code which is able to deserialize binary avro, present next to each other, and seamlessly moving the bytes positions
from avro import schema, datafile, io
import io
import avro
import requests
import os
topic=r'TOPIC_NAME'
schemaurl=r'SCHEMA_REGISTRY_URL'
OUTFILE_NAME = r'INPUT_BINARY_AVRO_FILE_LOCATION'
f=open(OUTFILE_NAME,'rb')
buf = io.BytesIO(f.read())
decoder = avro.io.BinaryDecoder(buf)
while buf.tell()<os.path.getsize(OUTFILE_NAME):
id=int.from_bytes((buf.read(4)), byteorder='big')
SCHEMA = avro.schema.Parse(getSchema(schemaurl,id))
rec_reader = avro.io.DatumReader(SCHEMA)
out=rec_reader.read(decoder)
print(out)

Using iterated writing in HDFS file by using Spark/Scala

I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.