Inserting a document into Elastic using Alpakka - scala

I'm trying to learn how to use Alpakka and have setup a test to write a document to Elastic. From reading docs, including https://doc.akka.io/docs/alpakka/current/elasticsearch.html have written the following :
import akka.actor.ActorSystem
import akka.stream.alpakka.elasticsearch.scaladsl.ElasticsearchSink
import akka.stream.alpakka.elasticsearch._
import akka.stream.scaladsl.Source
import spray.json.DefaultJsonProtocol._
import spray.json.{JsonFormat, _}
object AlpakkaWrite extends App{
case class VolResult(symbol : String, vol : Double, timestamp : Long)
implicit val actorSystem = ActorSystem()
val connectionString = "****";
val userName = "****"
val password = "****"
def constructElasticsearchParams(indexName: String, typeName: String, apiVersion: ApiVersion) =
if (apiVersion eq ApiVersion.V5)
ElasticsearchParams.V5(indexName, typeName)
else if (apiVersion eq ApiVersion.V7)
ElasticsearchParams.V7(indexName)
else
throw new IllegalArgumentException("API version " + apiVersion + " is not supported")
val connectionSettings = ElasticsearchConnectionSettings
.create(connectionString).withCredentials(userName, password)
val sinkSettings =
ElasticsearchWriteSettings.create(connectionSettings).withApiVersion(ApiVersion.V7);
implicit val formatVersionTestDoc: JsonFormat[VolResult] = jsonFormat3(VolResult)
Source(List(VolResult("test" , 1 , System.currentTimeMillis())))
.map { message: VolResult =>
WriteMessage.createIndexMessage("00002", message )
}
.log(("Error"))
.runWith(
ElasticsearchSink.create[VolResult](
constructElasticsearchParams("ccy_vol_normalized", "_doc", ApiVersion.V7),
settings = sinkSettings
)
)
}
Outputs :
19:15:51.815 [default-akka.actor.default-dispatcher-5] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
19:15:52.547 [default-akka.actor.default-dispatcher-5] ERROR akka.stream.alpakka.elasticsearch.impl.ElasticsearchSimpleFlowStage$StageLogic - Received error from elastic after having already processed 0 documents. Error: java.lang.RuntimeException: Request failed for POST /_bulk
Have I defined the case class DataPayload correctly ? It does match the expected payload defined in the index mapping ? :
"properties": {
"timestamp": { "type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
},
"vol": { "type": "float" },
"symbol": { "type": "text" }
}
Using Elastic dev tools the following command will insert a document successfully :
POST ccy_vol_normalized/_doc/
{
"timestamp": "2022-10-21T00:00:00.000Z",
"vol": 1.221,
"symbol" : "SYM"
}

This works :
import akka.actor.ActorSystem
import akka.stream.alpakka.elasticsearch._
import akka.stream.alpakka.elasticsearch.scaladsl.ElasticsearchSink
import akka.stream.scaladsl.Source
import spray.json.DefaultJsonProtocol._
import spray.json.JsonFormat
import java.text.SimpleDateFormat
import java.util.Date
object AlpakkaWrite extends App {
val connectionString = "";
implicit val actorSystem = ActorSystem()
val userName = ""
val password = ""
val connectionSettings = ElasticsearchConnectionSettings
.create(connectionString).withCredentials(userName, password)
val sinkSettings =
ElasticsearchWriteSettings.create(connectionSettings).withApiVersion(ApiVersion.V7);
val HOUR = 1000 * 60 * 60
val utcDate = new Date(System.currentTimeMillis() - HOUR)
val ts = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS").format(utcDate) + "Z"
implicit val formatVersionTestDoc: JsonFormat[VolResult] = jsonFormat3(VolResult)
def constructElasticsearchParams(indexName: String, typeName: String, apiVersion: ApiVersion) =
if (apiVersion eq ApiVersion.V5)
ElasticsearchParams.V5(indexName, typeName)
else if (apiVersion eq ApiVersion.V7)
ElasticsearchParams.V7(indexName)
else
throw new IllegalArgumentException("API version " + apiVersion + " is not supported")
case class VolResult(symbol: String, vol: Double, timestamp: String)
println("ts : " + ts)
Source(List(VolResult("test1", 1, ts)))
.map { message: VolResult =>
WriteMessage.createIndexMessage(System.currentTimeMillis().toString, message)
}
.log(("Error"))
.runWith(
ElasticsearchSink.create[VolResult](
constructElasticsearchParams("ccy_vol_normalized", "_doc", ApiVersion.V7),
settings = sinkSettings
)
)
}
My date format was incorrect, using :
val HOUR = 1000 * 60 * 60
val utcDate = new Date(System.currentTimeMillis() - HOUR)
val ts = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS").format(utcDate) + "Z"
fixed the issue.

Related

scala class member function as UDF

I am trying to define a member function in a class that would be used as UDF while parsing data from a json file. I am using trait to a define a set of methods and a class to override those methods.
trait geouastr {
def getGeoLocation(ipAddress: String): Map[String, String]
def uaParser(ua: String): Map[String, String]
}
class GeoUAData(appName: String, sc: SparkContext, conf: SparkConf, combinedCSV: String) extends geouastr with Serializable {
val spark = SparkSession.builder.config(conf).getOrCreate()
val GEOIP_FILE_COMBINED = combinedCSV;
val logger = LogFactory.getLog(this.getClass)
val allDF = spark.
read.
option("header","true").
option("inferSchema", "true").
csv(GEOIP_FILE_COMBINED).cache
val emptyMap = Map(
"country" -> "",
"state" -> "",
"city" -> "",
"zipCode" -> "",
"latitude" -> 0.0.toString(),
"longitude" -> 0.0.toString())
override def getGeoLocation(ipAddress: String): Map[String, String] = {
val ipLong = ipToLong(ipAddress)
try {
logger.error("Entering UDF " + ipAddress + " allDF " + allDF.count())
val resultDF = allDF.
filter(allDF("network").cast("long") <= ipLong.get).
filter(allDF("broadcast") >= ipLong.get).
select(allDF("country_name"), allDF("subdivision_1_name"),allDF("city_name"),
allDF("postal_code"),allDF("latitude"),allDF("longitude"))
val matchingDF = resultDF.take(1)
val matchRow = matchingDF(0)
logger.error("Lookup for " + ipAddress + " Map " + matchRow.toString())
val geoMap = Map(
"country" -> nullCheck(matchRow.getAs[String](0)),
"state" -> nullCheck(matchRow.getAs[String](1)),
"city" -> nullCheck(matchRow.getAs[String](2)),
"zipCode" -> nullCheck(matchRow.getAs[String](3)),
"latitude" -> matchRow.getAs[Double](4).toString(),
"longitude" -> matchRow.getAs[Double](5).toString())
} catch {
case (nse: NoSuchElementException) => {
logger.error("No such element", nse)
emptyMap
}
case (npe: NullPointerException) => {
logger.error("NPE for " + ipAddress + " allDF " + allDF.count(),npe)
emptyMap
}
case (ex: Exception) => {
logger.error("Generic exception " + ipAddress,ex)
emptyMap
}
}
}
def nullCheck(input: String): String = {
if(input != null) input
else ""
}
override def uaParser(ua: String): Map[String, String] = {
val client = Parser.get.parse(ua)
return Map(
"os"->client.os.family,
"device"->client.device.family,
"browser"->client.userAgent.family)
}
def ipToLong(ip: String): Option[Long] = {
Try(ip.split('.').ensuring(_.length == 4)
.map(_.toLong).ensuring(_.forall(x => x >= 0 && x < 256))
.zip(Array(256L * 256L * 256L, 256L * 256L, 256L, 1L))
.map { case (x, y) => x * y }
.sum).toOption
}
}
I notice uaParser to be working fine, while getGeoLocation is returning emptyMap(running into NPE). Adding snippet that shows how i am using this in main method.
val appName = "SampleApp"
val conf: SparkConf = new SparkConf().setAppName(appName)
val sc: SparkContext = new SparkContext(conf)
val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
val geouad = new GeoUAData(appName, sc, conf, args(1))
val uaParser = Sparkudf(geouad.uaParser(_: String))
val geolocation = Sparkudf(geouad.getGeoLocation(_: String))
val sampleRdd = sc.textFile(args(0))
val json = sampleRdd.filter(_.nonEmpty)
import spark.implicits._
val sampleDF = spark.read.json(json)
val columns = sampleDF.select($"user-agent", $"source_ip")
.withColumn("sourceIp", $"source_ip")
.withColumn("geolocation", geolocation($"source_ip"))
.withColumn("uaParsed", uaParser($"user-agent"))
.withColumn("device", ($"uaParsed") ("device"))
.withColumn("os", ($"uaParsed") ("os"))
.withColumn("browser", ($"uaParsed") ("browser"))
.withColumn("country" , ($"geolocation")("country"))
.withColumn("state" , ($"geolocation")("state"))
.withColumn("city" , ($"geolocation")("city"))
.withColumn("zipCode" , ($"geolocation")("zipCode"))
.withColumn("latitude" , ($"geolocation")("latitude"))
.withColumn("longitude" , ($"geolocation")("longitude"))
.drop("geolocation")
.drop("uaParsed")
Questions:
1. Should we switch from class to object for defining UDFs? (i can keep it as singleton)
2. Can class member function be used as UDF?
3. When such a UDF is invoked, will class member like allDF remain initialized?
4. Val declared as member variable - will it get initialized at the time of construction of geouad?
I am new to Scala, Thanks in advance for guidance/suggestions.
No, switching from class to object is not necessary for defining UDF, it is only different while calling the UDF.
Yes, you can use class member function as UDF, but first you need to register the function as an UDF.
spark.sqlContext.udf.register("registeredName", Class Method _)
No, other methods are initialized when calling one UDF
Yes, the class variable val will be initialized at the time of calling geouad and performing some actions.

spark collect method taking too much time when processing records stored in RDD[String]

I have a requirement where I have to pull parquet file from S3 process it and convert into another object format and store it in S3 in json and Parquet format.
I have done the below changes for this problem statement, but the Spark job is taking too much time when collect statement is called Please Let me know how this can be optimized, Below is the complete Code which reads Parquet file from S3 and process it and store it to S3. I am very new to Spark and BigData technology
package com.expedia.www.lambda
import java.io._
import com.amazonaws.ClientConfiguration
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing}
import com.expedia.hendrix.lambda.HotelInfosite
import com.expedia.www.hendrix.signals.definition.local.HotelInfoSignal
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import com.expedia.www.user.interaction.v1.UserInteraction
import com.expedia.www.util._
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.lang.exception.ExceptionUtils
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.slf4j.{Logger, LoggerFactory}
import scala.collection.JavaConverters._
import scala.io.Source
import scala.util.Random
object GenericLambdaMapper{
private def currentTimeMillis: Long = System.currentTimeMillis
/** The below Generic mapper object is built for creating json similar to the Signal pushed by hendrix */
def populateSignalRecord( genericRecord: GenericRecord, uisMessage: UserInteraction, signalType: String): HotelInfoSignal ={
val objectMapper:ObjectMapper = new ObjectMapper
objectMapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
objectMapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
val hotelInfoObject = objectMapper.readValue( genericRecord.toString, classOf[com.expedia.www.hendrix.signals.definition.local.HotelInfosite])
val userKey = UserKeyUtil.createUserKey(uisMessage)
val hotelInfoSignal:HotelInfoSignal = new HotelInfoSignal
hotelInfoSignal.setSignalType(signalType)
hotelInfoSignal.setData(hotelInfoObject)
hotelInfoSignal.setUserKey(userKey)
hotelInfoSignal.setGeneratedAtTimestamp(currentTimeMillis)
return hotelInfoSignal
}
}
class GenericLambdaMapper extends Serializable{
var LOGGER:Logger = LoggerFactory.getLogger("GenericLambdaMapper")
var bw : BufferedWriter = null
var fw :FileWriter = null
val random: Random = new Random
var counter: Int = 0
var fileName: String= null
val s3Util = new S3Util
/** Object Mapper function for serializing and deserializing objects**/
def objectMapper : ObjectMapper= {
val mapper = new ObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
}
def process(sparkContext: SparkContext, options: HendrixHistoricalOfflineProcessorOptions ): Unit = { //ObjectListing
try {
LOGGER.info("Start Date : "+options.startDate)
LOGGER.info("END Date : "+options.endDate)
val listOfFilePath: List[String] = DateTimeUtil.getDateRangeStrFromInput(options.startDate, options.endDate)
/**Looping through each folder based on start and end date **/
listOfFilePath.map(
path => applyLambdaForGivenPathAndPushToS3Signal( sparkContext, path, options )
)
}catch {
case ex: Exception => {
LOGGER.error( "Exception in downloading data :" + options.rawBucketName + options.rawS3UploadRootFolder + options.startDate)
LOGGER.error("Stack Trace :"+ExceptionUtils.getFullStackTrace(ex))
}
}
}
// TODO: Currently the Lambda is hardcoded only to HotelInfoSite to be made generic
def prepareUisObjectAndApplyLambda(uisMessage: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): List[GenericRecord] = {
try {
val schemaDefinition = Source.fromInputStream(getClass.getResourceAsStream("/"+options.avroSchemaName)).getLines.mkString("\n")
val schemaHotelInfo = new Schema.Parser().parse(schemaDefinition)
HotelInfosite.apply(uisMessage, schemaHotelInfo).toList
}catch {
case ex: Exception => LOGGER.error("Exception while preparing UIS Object" + ex.toString)
List.empty
}
}
/** Below method is used to extract userInteraction Data from Raw file **/
private def constructUisObject(uisMessageRaw: String): UserInteraction = objectMapper.readValue( uisMessageRaw, classOf[UserInteraction])
/** Below function contains logic to apply the lambda for the given range of dates and push to signals folder in S3 **/
def applyLambdaForGivenPathAndPushToS3Signal(sparkContext: SparkContext, dateFolderPath: String, options: HendrixHistoricalOfflineProcessorOptions ): Unit ={
var awsS3Client: AmazonS3Client = null;
try {
if ("sandbox".equals(options.environment)) {
val clientConfiguration = new ClientConfiguration()
.withConnectionTimeout(options.awsConnectionTimeout)
.withSocketTimeout(options.awsSocketTimeout)
.withTcpKeepAlive(true)
awsS3Client = S3Client.getAWSConnection(options.awsS3AccessKey, options.awsS3SecretKey, clientConfiguration)
} else {
awsS3Client = S3Client.getAWSConnection
}
/** Validate if destination path has any gzip file if so then just skip that date and process next record **/
LOGGER.info("Validating if the destination folder path is empty: " + dateFolderPath)
var objectListing: ObjectListing = null
var listObjectsRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(options.destinationBucketName).withPrefix(options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
if (objectListing.getObjectSummaries.size > 0) {
LOGGER.warn("Record already present at the below location, so skipping the processing of record for the folder path :" + dateFolderPath.toString)
LOGGER.warn("s3n://" + options.destinationBucketName + "/" + options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
return
}
LOGGER.info("Validated the destination folder path :" + dateFolderPath + " and found it to be empty ")
/** End of validation **/
/*Selecting all the files under the source path and iterating*/
counter = 0
listObjectsRequest = new ListObjectsRequest().withBucketName(options.rawBucketName).withPrefix(options.rawS3UploadRootFolder + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
val rddListOfParquetFileNames = objectListing.getObjectSummaries.asScala.map(_.getKey).toList
rddListOfParquetFileNames.flatMap{key => { processIndividualParquetFileAndUploadToS3(sparkContext, awsS3Client, options, key, dateFolderPath)
"COMPLETED Processing=>"+key;
}}
}catch{
case ex: Exception =>
LOGGER.error("Exception occured while processing records for the path " + dateFolderPath)
LOGGER.error("Exception in Apply Lambda method Message :" + ex.getMessage + "\n Stack Trace :" + ex.getStackTrace)
}finally {
awsS3Client.shutdown
LOGGER.info("JOB Complete ")
}
}
def processIndividualParquetFileAndUploadToS3(sparkContext: SparkContext, awsS3Client: AmazonS3Client, options: HendrixHistoricalOfflineProcessorOptions, parquetFilePath:String, dateFolderPath:String ):Unit ={
try{
LOGGER.info("Currently Processing the Parquet file: "+parquetFilePath)
LOGGER.info("Starting to reading Parquet File Start Time: "+System.currentTimeMillis)
val dataSetString: RDD[String] = ParquetHelper.readParquetData(sparkContext, options, parquetFilePath)
LOGGER.info("Data Set returned from Parquet file Successful Time: "+System.currentTimeMillis)
val lambdaSignalRecords: Array[HotelInfoSignal] = dataSetString.map(x => constructUisObject(x))
.filter(_ != null)
.map(userInteraction => processIndividualRecords(userInteraction, options))
.filter(_ != null)
.collect
LOGGER.info("Successfully Generated "+lambdaSignalRecords.length+" Signal Records")
if(lambdaSignalRecords.length > 0) {
//Write to Paraquet File :Start
val parquetFileName: String = getFileNameForParquet(dateFolderPath, counter)
val parquetWriter = ParquetHelper.newParquetWriter(HotelInfoSignal.getClassSchema, dateFolderPath, parquetFileName, options)
LOGGER.info("Initialized Parquet Writer")
lambdaSignalRecords.map(signalRecord => parquetWriter.write(signalRecord))
LOGGER.info("Completed writing the data in Parquet format")
parquetWriter.close
//Parquet Write Complete
/*val avroSignalString = lambdaSignalRecords.mkString("\n")
val sparkSession = SparkSession.builder.getOrCreate
uploadProceessedDataToS3(sparkSession, awsS3Client, dateFolderPath, avroSignalString, options)
*/ }
}catch {case ex:Exception =>
LOGGER.error("Skipping processing of record :"+parquetFilePath+" because of Exception: "+ExceptionUtils.getFullStackTrace(ex))
}
LOGGER.info("Completed data processing for file :" + options.rawBucketName + options.rawS3UploadRootFolder + parquetFilePath)
}
def uploadProceessedDataToS3(sparkSession:SparkSession, awsS3Client: AmazonS3Client, filePath: String, genericSignalRecords: String, options: HendrixHistoricalOfflineProcessorOptions):Unit ={
var jsonFile: File = null
var gzFile: File = null
try {
//Building the file name based on the folder accessed
fileName = getFileName (filePath, counter)
jsonFile = IOUtil.createS3JsonFile (genericSignalRecords, fileName)
gzFile = IOUtil.gzipIt (jsonFile)
s3Util.uploadToS3(awsS3Client, options.destinationBucketName, options.s3SignalRootFolder + options.signalType + "/" + filePath, gzFile)
counter += 1 //Incement counter
} catch {
case ex: RuntimeException => LOGGER.error ("Exception while uploading file to path :" + options.s3SignalRootFolder + options.signalType + "/" + filePath + "/" + fileName)
LOGGER.error ("Stack Trace for S3 Upload :" + ExceptionUtils.getFullStackTrace(ex))
} finally {
//Cleaning the temp file created after upload to s3, we can create a temp dir if required.
jsonFile.delete
gzFile.delete
}
}
def processIndividualRecords(userInteraction: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): HotelInfoSignal ={
try {
//Applying lambda for the indivisual UserInteraction
val list: List[GenericRecord] = prepareUisObjectAndApplyLambda (userInteraction, options)
if (list.nonEmpty) return GenericLambdaMapper.populateSignalRecord (list.head, userInteraction, options.signalType)
} catch { case ex: Exception => LOGGER.error ("Error while creating signal record from UserInteraction for Singal Type :"+ options.signalType +" For Interaction "+userInteraction.toString)
LOGGER.error ("Stack Trace while processIndividualRecords :" + ExceptionUtils.getFullStackTrace(ex))}
null
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileName(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".json"
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileNameForParquet(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".parquet"
}
}
package com.expedia.www.util
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.{AvroParquetWriter, AvroSchemaConverter}
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import org.apache.parquet.hadoop.{ParquetFileWriter, ParquetWriter}
import org.apache.parquet.schema.MessageType
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.slf4j.{Logger, LoggerFactory}
/**
* Created by prasubra on 2/17/17.
*/
object ParquetHelper {
val LOGGER:Logger = LoggerFactory.getLogger("ParquetHelper")
def newParquetWriter(signalSchema: Schema, folderPath:String, fileName:String, options:HendrixHistoricalOfflineProcessorOptions): ParquetWriter[GenericRecord] = {
val blockSize: Int = 256 * 1024 * 1024
val pageSize: Int = 64 * 1024
val compressionCodec = if (options.parquetCompressionToGzip) CompressionCodecName.GZIP else CompressionCodecName.UNCOMPRESSED
val path: Path = new Path("s3n://" + options.destinationBucketName + "/" + options.parquetSignalFolderName + options.signalType + "/" + folderPath + "/" + fileName);
val parquetSchema: MessageType = new AvroSchemaConverter().convert(signalSchema);
// var writeSupport:WriteSupport = new AvroWriteSupport(parquetSchema, signalSchema);
//(path, writeSupport, compressionCodec, blockSize, pageSize)
//var parquetWriter:ParquetWriter[GenericRecord] = new ParquetWriter(path, writeSupport, compressionCodec, blockSize, pageSize);
if ("sandbox".equals(options.environment)) {
val hadoopConf = new Configuration
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", options.awsS3AccessKey)
hadoopConf.set("fs.s3n.awsSecretAccessKey", options.awsS3SecretKey)
hadoopConf.set("fs.s3n.maxRetries", options.awsFileReaderRetry)
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withConf(hadoopConf)
.build()
} else {
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withPageSize(pageSize)
.build()
}
}
def readParquetData(sc: SparkContext, options: HendrixHistoricalOfflineProcessorOptions, filePath: String): RDD[String] = {
val filePathOfParquet = "s3n://"+options.rawBucketName+"/"+ filePath
LOGGER.info("Reading Parquet file from path :"+filePathOfParquet)
val sparkSession = SparkSession.builder.getOrCreate
val dataFrame = sparkSession.sqlContext.read.parquet(filePathOfParquet)
//dataFrame.printSchema()
dataFrame.toJSON.rdd
}
}
First, you really should improve your questions, with a minimal code example. It's really hard to see whats going on in your code...
Collect retrieves all elements of your RDD into a single RDD on the driver. If your RDD is large, then this will of course take a lot of time (and maybe cause an OutOfMemeoryError if the content does not fit into the driver's main memory).
You can directly write the content of a Dataframe/Dataset using parquet. This will surely be much faster and more scalable.
Use s3a:// URLs . S3n// has a bug which really kills ORC/Parquet performance, and has been superceded by s3a now

Scala - Twitter API returns 401 for oauth

I'm making my first steps in Scala and trying to implement application which uses Twitter streaming API. Below is my code (user tokens are hidden). From main function, I call getStreamData function, which calls makeAPIrequest.
package com.myname.myapp
import java.net.URL
import javax.net.ssl.HttpsURLConnection
import java.io.InputStream
import java.io.OutputStream;
import scala.io.Source
import java.net.URLEncoder
import java.util.Base64
import java.nio.charset.StandardCharsets
import scala.collection.immutable.HashMap
import java.util.Calendar
import java.io.Serializable
import scala.collection.immutable.TreeMap
import javax.crypto
import java.security.SecureRandom
import java.math.BigInteger
import scala.util.Random
object TwitterConnector {
private val AUTH_URL: String = "https://api.twitter.com/oauth2/token"
private val CONSUMER_KEY: String = "mykey"
private val CONSUMER_SECRET: String = "mysecret"
private val STREAM_URL: String = "https://stream.twitter.com/1.1/statuses/filter.json"
private var TOKEN: String = "mytoken"
private var TOKEN_SECRET: String = "mytokensecret"
def getStreamData {
val data = "track=" + "twitter"
makeAPIrequest(HTTPmethod("POST"), "https://stream.twitter.com/1.1/statuses/filter.json", None, Option(data))
}
private def makeAPIrequest(method: HTTPmethod, url:String, urlParams:Option[String], data:Option[String]){
//form oauth parameters
val oauth_nonce = Random.alphanumeric.take(32).mkString
val oauth_signature_method: String = "HMAC-SHA1"
val oauth_version: String = "1.0"
val oauth_timestamp = (Calendar.getInstance.getTimeInMillis/1000).toString()
var signatureData = scala.collection.mutable.Map(("oauth_consumer_key", CONSUMER_KEY), ("oauth_token", TOKEN), ("oauth_signature_method", oauth_signature_method), ("oauth_nonce", oauth_nonce), ("oauth_timestamp", oauth_timestamp), ("oauth_version", oauth_version))
//find keys for parameters
val getParams = (parameter: String) => {
val arr = parameter.split("=")
if(arr.length == 1) return
val key = arr(0).asInstanceOf[String]
val value = arr(1).asInstanceOf[String]
signatureData(key) = value
}
val params = urlParams match {
case Some(value) => {
val result = urlParams.get
result.split("&").foreach {getParams}
result
}
case None => ""
}
val postData = data match {
case Some(value) => {
val result = data.get
result.split("&").foreach {getParams}
result
}
case None => ""
}
//url-encode headers data
signatureData.foreach { elem => {
signatureData.remove(elem._1)
signatureData(urlEnc(elem._1)) = urlEnc(elem._2)
}
}
println(signatureData)
//sort headers data
val sortedSignatureData = TreeMap(signatureData.toSeq:_*)
println("Sorted: " + sortedSignatureData)
//form output string
var parameterString = ""
sortedSignatureData.foreach(elem => {
if(parameterString.length() > 0){
parameterString += "&"
}
parameterString += elem._1 + "=" + elem._2
})
val outputString = method.method.toUpperCase() + "&" + urlEnc(url) + "&" + urlEnc(parameterString)
val signingKey = urlEnc(CONSUMER_SECRET) + "&" + urlEnc(TOKEN_SECRET)
println(outputString)
println(signingKey)
val SHA1 = "HmacSHA1";
val key = new crypto.spec.SecretKeySpec(bytes(signingKey), SHA1)
val oauth_signature = {
val mac = crypto.Mac.getInstance(SHA1)
mac.init(key)
new String(base64(mac.doFinal(bytes(outputString)).toString()))
}
println("Signature: " + oauth_signature)
val authHeader: String = "OAuth oauth_consumer_key=\"" + urlEnc(CONSUMER_KEY) + "\", oauth_nonce=\"" + urlEnc(oauth_nonce) + "\", oauth_signature=\"" + urlEnc(oauth_signature) + "\", oauth_signature_method=\"HMAC-SHA1\", oauth_timestamp=\"" + urlEnc(oauth_timestamp) + "\", oauth_token=\"" + urlEnc(TOKEN) + "\", oauth_version=\"1.0\""
println(authHeader)
var text = url
if(params.length > 0){
text += "?"
}
val apiURL: URL = new URL(text + params)
val apiConnection: HttpsURLConnection = apiURL.openConnection.asInstanceOf[HttpsURLConnection]
apiConnection.setRequestMethod(method.method)
apiConnection.setRequestProperty("Authorization", authHeader)
apiConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=UTF-8")
if(method.method == "POST" && postData.length() > 0){
println("POSTING ", postData)
apiConnection.setDoOutput(true)
val outStream: OutputStream = apiConnection.getOutputStream
outStream.write(postData.getBytes())
}
val inStream: InputStream = apiConnection.getInputStream
val serverResponse = Source.fromInputStream(inStream).mkString
println(serverResponse)
}
private def bytes(str: String) = str.getBytes("UTF-8")
private def urlEnc(str: String) = URLEncoder.encode(str, "UTF-8").replace(" ", "%20")
private def base64(str: String) = Base64.getEncoder.encodeToString(str.getBytes(StandardCharsets.UTF_8))
}
Twitter returns me 401 code response.
Obviously, I'm doing something wrong. Could you point me where my error is?
I recommend using a better library for making web requests, such as the WS library from the Play Framework. Right now, you're sort of writing Java in Scala. Here's a sample usage of the WS library:
val clientConfig = new DefaultWSClientConfig()
val secureDefaults: com.ning.http.client.AsyncHttpClientConfig = new NingAsyncHttpClientConfigBuilder(clientConfig).build()
val builder = new com.ning.http.client.AsyncHttpClientConfig.Builder(secureDefaults)
builder.setCompressionEnabled(true)
val secureDefaultsWithSpecificOptions: com.ning.http.client.AsyncHttpClientConfig = builder.build()
implicit val implicitClient = new play.api.libs.ws.ning.NingWSClient(secureDefaultsWithSpecificOptions)
val oauthCalc = OAuthCalculator(ConsumerKey(TwitterConfig.consumerKey, TwitterConfig.consumerSecret), RequestToken(TwitterConfig.accessKey, TwitterConfig.accessSecret))
def lookup(ids: List[String]): Future[List[Tweet]] =
WS.clientUrl(`statuses/lookup`)
.withQueryString("id" -> ids.mkString(","))
.sign(oauthCalc)
.get()
.map { r =>
JsonHelper.deserialize[List[Tweet]](r.body)
}
You should be able to modify this example pretty easily to work with the streaming API.

Akka Dead Letters with Ask Pattern

I apologize in advance if this seems at all confusing, as I'm dumping quite a bit here. Basically, I have a small service grabbing some Json, parsing and extracting it to case class(es), then writing it to a database. This service needs to run on a schedule, which is being handled well by an Akka scheduler. My database doesn't like when Slick tries to ask for a new AutoInc id at the same time, so I built in an Await.result to block that from happening. All of this works quite well, but my issue starts here: there are 7 of these services running, so I would like to block each one using a similar Await.result system. Every time I try to send the end time of the request back as a response (at the end of the else block), it gets sent to dead letters instead of to the Distributor. Basically: why does sender ! time go to dead letters and not to Distributor. This is a long question for a simple problem, but that's how development goes...
ClickActor.scala
import java.text.SimpleDateFormat
import java.util.Date
import Message._
import akka.actor.{Actor, ActorLogging, Props}
import akka.util.Timeout
import com.typesafe.config.ConfigFactory
import net.liftweb.json._
import spray.client.pipelining._
import spray.http.{BasicHttpCredentials, HttpRequest, HttpResponse, Uri}
import akka.pattern.ask
import scala.concurrent.{Await, Future}
import scala.concurrent.duration._
case class ClickData(recipient : String, geolocation : Geolocation, tags : Array[String],
url : String, timestamp : Double, campaigns : Array[String],
`user-variables` : JObject, ip : String,
`client-info` : ClientInfo, message : ClickedMessage, event : String)
case class Geolocation(city : String, region : String, country : String)
case class ClientInfo(`client-name`: String, `client-os`: String, `user-agent`: String,
`device-type`: String, `client-type`: String)
case class ClickedMessage(headers : ClickHeaders)
case class ClickHeaders(`message-id` : String)
class ClickActor extends Actor with ActorLogging{
implicit val formats = DefaultFormats
implicit val timeout = new Timeout(3 minutes)
import context.dispatcher
val con = ConfigFactory.load("connection.conf")
val countries = ConfigFactory.load("country.conf")
val regions = ConfigFactory.load("region.conf")
val df = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss -0000")
var time = System.currentTimeMillis()
var begin = new Date(time - (12 hours).toMillis)
var end = new Date(time)
val pipeline : HttpRequest => Future[HttpResponse] = (
addCredentials(BasicHttpCredentials("api", con.getString("mailgun.key")))
~> sendReceive
)
def get(lastrun : Long): Future[String] = {
if(lastrun != 0) {
begin = new Date(lastrun)
end = new Date(time)
}
val uri = Uri(con.getString("mailgun.uri")) withQuery("begin" -> df.format(begin), "end" -> df.format(end),
"ascending" -> "yes", "limit" -> "100", "pretty" -> "yes", "event" -> "clicked")
val request = Get(uri)
val futureResponse = pipeline(request)
return futureResponse.map(_.entity.asString)
}
def receive = {
case lastrun : Long => {
val start = System.currentTimeMillis()
val responseFuture = get(lastrun)
responseFuture.onSuccess {
case payload: String => val json = parse(payload)
//println(pretty(render(json)))
val elements = (json \\ "items").children
if (elements.length == 0) {
log.info("[ClickActor: " + this.hashCode() + "] did not find new events between " +
begin.toString + " and " + end.toString)
sender ! time
context.stop(self)
}
else {
for (item <- elements) {
val data = item.extract[ClickData]
var tags = ""
if (data.tags.length != 0) {
for (tag <- data.tags)
tags += (tag + ", ")
}
var campaigns = ""
if (data.campaigns.length != 0) {
for (campaign <- data.campaigns)
campaigns += (campaign + ", ")
}
val timestamp = (data.timestamp * 1000).toLong
val msg = new ClickMessage(
data.recipient, data.geolocation.city,
regions.getString(data.geolocation.country + "." + data.geolocation.region),
countries.getString(data.geolocation.country), tags, data.url, timestamp,
campaigns, data.ip, data.`client-info`.`client-name`,
data.`client-info`.`client-os`, data.`client-info`.`user-agent`,
data.`client-info`.`device-type`, data.`client-info`.`client-type`,
data.message.headers.`message-id`, data.event, compactRender(item))
val csqla = context.actorOf(Props[ClickSQLActor])
val future = csqla.ask(msg)
val result = Await.result(future, timeout.duration).asInstanceOf[Int]
if (result == 1) {
log.error("[ClickSQLActor: " + csqla.hashCode() + "] shutting down due to lack of system environment variables")
context.stop(csqla)
}
else if(result == 0) {
log.info("[ClickSQLActor: " + csqla.hashCode() + "] successfully wrote to the DB")
}
}
sender ! time
log.info("[ClickActor: " + this.hashCode() + "] processed |" + elements.length + "| new events in " +
(System.currentTimeMillis() - start) + " ms")
}
}
}
}
}
Distributor.scala
import akka.actor.{Props, ActorSystem}
import akka.event.Logging
import akka.util.Timeout
import akka.pattern.ask
import scala.concurrent.duration._
import scala.concurrent.Await
class Distributor {
implicit val timeout = new Timeout(10 minutes)
var lastClick : Long = 0
def distribute(system : ActorSystem) = {
val log = Logging(system, getClass)
val clickFuture = (system.actorOf(Props[ClickActor]) ? lastClick)
lastClick = Await.result(clickFuture, timeout.duration).asInstanceOf[Long]
log.info(lastClick.toString)
//repeat process with other events (open, unsub, etc)
}
}
The reason is because the value of 'sender' (which is a method that retrieves the value) is no longer valid after leaving the receive block, yet the future that is being used in the above example will still be running and by the time that it finishes the actor will have left the receive block and bang; an invalid sender results in the message going to the dead letter queue.
The fix is to either not use a future, or when combining futures, actors and sender then capture the value of sender before you trigger the future.
val s = sender
val responseFuture = get(lastrun)
responseFuture.onSuccess {
....
s ! time
}

How to parse a json with date exported by mongoexport in Scala?

Given this code example:
import com.mongodb.util.JSON
import com.mongodb.casbah.Imports._
val json = """{"date" : { "$date" : 1327064009959 }}"""
val doc = JSON.parse(json)
I get this error:
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
What can I do to get this parsed correctly in Scala with Casbah?
There is a solution, which I don't like too much, thou:
import com.mongodb.util.JSON
import com.mongodb.casbah.Imports._
import scala.util.matching.Regex
val json = """{"date" : { "$date" : 1327064009959 }}"""
val doc = JSON.parse(json)
var regex = new Regex("""\{ "\$date" : (\d+) \}""", "date")
val fixed = regex replaceAllIn (json, m => "\"" + (new DateTime(m.group("date").toLong)) + "\"" )
val doc = JSON.parse(fixed).asInstanceOf[DBObject]
Check this typo, this is the valid JSON you should pass
var json = '
{
"date": {
"$date": 1327064009959
}
}';