Saving KMeansModel on HDFS

Saving KMeansModel on HDFS - scala

I want to save a KMeans Model on Hdfs. To do this I use the method save and create the otuput directory during runtime(See code). I get the Error Exception metadata already exists. How can I solve this problem?
val lastUrbanKMeansModel = KMeansModel.load(spark, defaultPath + "UrbanRoad/201692918")
val newUrbanKMeansObject = new KMeans()
.setK(7)
.setMaxIterations(20)
.setInitialModel(lastUrbanKMeansModel)
val vectorUrbanRoad = typeStreet.filter(k => k._2 == 1).map(_._1)
if (!vectorUrbanRoad.isEmpty()) {
val newUrbanModel = newUrbanKMeansObject.run(vectorUrbanRoad)
newUrbanModel.save(spark, defaultPath + "UrbanRoad/" +
Calendar.getInstance().get(Calendar.YEAR).toString
+ (Calendar.getInstance().get(Calendar.MONTH) + 1).toString +
Calendar.getInstance().get(Calendar.DAY_OF_MONTH).toString +
Calendar.getInstance().get(Calendar.HOUR_OF_DAY).toString)
}

Related

Task not serializable in scala

In my application, I'm using parallelize method to save an Array into file.
code as follows:
val sourceRDD = sc.textFile(inputPath + "/source")
val destinationRDD = sc.textFile(inputPath + "/destination")
val source_primary_key = sourceRDD.map(rec => (rec.split(",")(0).toInt, rec))
val destination_primary_key = destinationRDD.map(rec => (rec.split(",")(0).toInt, rec))
val extra_in_source = source_primary_key.subtractByKey(destination_primary_key)
val extra_in_destination = destination_primary_key.subtractByKey(source_primary_key)
val source_subtract = source_primary_key.subtract(destination_primary_key)
val Destination_subtract = destination_primary_key.subtract(source_primary_key)
val exact_bestmatch_src = source_subtract.subtractByKey(extra_in_source).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_Dest = Destination_subtract.subtractByKey(extra_in_destination).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_src_p = exact_bestmatch_src.map(rec => (rec.split(",")(0).toInt))
val primary_key_distinct = exact_bestmatch_src_p.distinct.toArray()
for (i <- primary_key_distinct) {
var dummyVar: String = ""
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
var dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray
for (print1 <- src) {
var sourceArr: Array[String] = print1.split(",")
var exactbestMatchCounter: Int = 0
var index: Array[Int] = new Array[Int](1)
println(print1 + "source")
for (print2 <- dest) {
var bestMatchCounter = 0
var i: Int = 0
println(print1 + "source + destination" + print2)
for (i <- 0 until sourceArr.length) {
if (print1.split(",")(i).equals(print2.split(",")(i))) {
bestMatchCounter += 1
}
}
if (exactbestMatchCounter < bestMatchCounter) {
exactbestMatchCounter = bestMatchCounter
dummyVar = print2
index +:= exactbestMatchCounter //9,8,9
}
}
var z = index.zipWithIndex.maxBy(_._1)._2
if (exactbestMatchCounter >= 0) {
var samparr: Array[String] = new Array[String](4)
samparr +:= print1 + " BEST_MATCH " + dummyVar
var deletedest: Array[String] = new Array[String](1)
deletedest = dest.take(z) ++ dest.drop(1)
dest = deletedest
val myFile = sc.parallelize((samparr)).saveAsTextFile(outputPath)
I have used parallelize method and I even tried with below method to save it as a file
val myFile = sc.textFile(samparr.toString())
val finalRdd = myFile
finalRdd.coalesce(1).saveAsTextFile(outputPath)
but its keep throwing the error :
Exception in thread "main" org.apache.spark.SparkException: Task not serializable

You can't treat an RDD like a local collection. All operations against it happen over a distributed cluster. To work, all functions you run in that rdd must be serializable.
The line
for (print1 <- src) {
Here you are iterating over the RDD src, everything inside the loop must be serialize, as it will be run on the executors.
Inside however, you try to run sc.parallelize( while still inside that loop. SparkContext is not serializable. Working with rdds and sparkcontext are things you do on the driver, and cannot do within an RDD operation.
I'm entirely sure what you are trying to accomplish, but it looks like some sort of hand-coded join operation with the source and destination. You can't work with loops in rdds like you can with local collections. Make use of the apis map, join, groupby, and the like to create your final rdd then save that.
If you absolutely feel you must use a foreach loop over the rdd like this, then you can't use sc.parallelize().saveAsTextFile() Instead open an outputstream using the hadoop file api and write your array to the file manually.

Finally this piece of code helps me to save an array to file.
new PrintWriter(outputPath) { write(array.mkString(" ")); close }

FileUtil.fullyDeleteContents does not work

I've written a script. It is supposed to copy certain types of files from certain directories into other directories(which it should clean before hand). It copies everything but doesn't clean the directories. There are no errors or exeptions.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path, FileUtil}
val conf = new Configuration()
val fs = FileSystem.get(conf)
val listOfFileTypes = List("mta", "rcr", "sub")
val listOfPlatforms = List("B", "C", "H", "M", "Y")
for(fileType <- listOfFileTypes){
FileUtil.fullyDeleteContents(new File("/apps/hive/warehouse/soc.db/fct_evkuzmin/file_" + fileType))
for (platform <- listOfPlatforms) {
var srcPaths = fs.globStatus(new Path("/user/com/data/" + "20170404" + "_" + platform + "/*" + fileType + ".gz"))
var dstPath = new Path("/apps/hive/warehouse/soc.db/fair_usage/fct_evkuzmin/file_" + fileType)
for(srcPath <- srcPaths){
println("copying " + srcPath.getPath.toString)
FileUtil.copy(fs, srcPath.getPath, fs, dstPath, false, conf)
}
}
}
Why doesn't FileUtil.fullyDeleteContents work?

spark collect method taking too much time when processing records stored in RDD[String]

I have a requirement where I have to pull parquet file from S3 process it and convert into another object format and store it in S3 in json and Parquet format.
I have done the below changes for this problem statement, but the Spark job is taking too much time when collect statement is called Please Let me know how this can be optimized, Below is the complete Code which reads Parquet file from S3 and process it and store it to S3. I am very new to Spark and BigData technology
package com.expedia.www.lambda
import java.io._
import com.amazonaws.ClientConfiguration
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.{ListObjectsRequest, ObjectListing}
import com.expedia.hendrix.lambda.HotelInfosite
import com.expedia.www.hendrix.signals.definition.local.HotelInfoSignal
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import com.expedia.www.user.interaction.v1.UserInteraction
import com.expedia.www.util._
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.lang.exception.ExceptionUtils
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.slf4j.{Logger, LoggerFactory}
import scala.collection.JavaConverters._
import scala.io.Source
import scala.util.Random
object GenericLambdaMapper{
private def currentTimeMillis: Long = System.currentTimeMillis
/** The below Generic mapper object is built for creating json similar to the Signal pushed by hendrix */
def populateSignalRecord( genericRecord: GenericRecord, uisMessage: UserInteraction, signalType: String): HotelInfoSignal ={
val objectMapper:ObjectMapper = new ObjectMapper
objectMapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
objectMapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
val hotelInfoObject = objectMapper.readValue( genericRecord.toString, classOf[com.expedia.www.hendrix.signals.definition.local.HotelInfosite])
val userKey = UserKeyUtil.createUserKey(uisMessage)
val hotelInfoSignal:HotelInfoSignal = new HotelInfoSignal
hotelInfoSignal.setSignalType(signalType)
hotelInfoSignal.setData(hotelInfoObject)
hotelInfoSignal.setUserKey(userKey)
hotelInfoSignal.setGeneratedAtTimestamp(currentTimeMillis)
return hotelInfoSignal
}
}
class GenericLambdaMapper extends Serializable{
var LOGGER:Logger = LoggerFactory.getLogger("GenericLambdaMapper")
var bw : BufferedWriter = null
var fw :FileWriter = null
val random: Random = new Random
var counter: Int = 0
var fileName: String= null
val s3Util = new S3Util
/** Object Mapper function for serializing and deserializing objects**/
def objectMapper : ObjectMapper= {
val mapper = new ObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true)
}
def process(sparkContext: SparkContext, options: HendrixHistoricalOfflineProcessorOptions ): Unit = { //ObjectListing
try {
LOGGER.info("Start Date : "+options.startDate)
LOGGER.info("END Date : "+options.endDate)
val listOfFilePath: List[String] = DateTimeUtil.getDateRangeStrFromInput(options.startDate, options.endDate)
/**Looping through each folder based on start and end date **/
listOfFilePath.map(
path => applyLambdaForGivenPathAndPushToS3Signal( sparkContext, path, options )
)
}catch {
case ex: Exception => {
LOGGER.error( "Exception in downloading data :" + options.rawBucketName + options.rawS3UploadRootFolder + options.startDate)
LOGGER.error("Stack Trace :"+ExceptionUtils.getFullStackTrace(ex))
}
}
}
// TODO: Currently the Lambda is hardcoded only to HotelInfoSite to be made generic
def prepareUisObjectAndApplyLambda(uisMessage: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): List[GenericRecord] = {
try {
val schemaDefinition = Source.fromInputStream(getClass.getResourceAsStream("/"+options.avroSchemaName)).getLines.mkString("\n")
val schemaHotelInfo = new Schema.Parser().parse(schemaDefinition)
HotelInfosite.apply(uisMessage, schemaHotelInfo).toList
}catch {
case ex: Exception => LOGGER.error("Exception while preparing UIS Object" + ex.toString)
List.empty
}
}
/** Below method is used to extract userInteraction Data from Raw file **/
private def constructUisObject(uisMessageRaw: String): UserInteraction = objectMapper.readValue( uisMessageRaw, classOf[UserInteraction])
/** Below function contains logic to apply the lambda for the given range of dates and push to signals folder in S3 **/
def applyLambdaForGivenPathAndPushToS3Signal(sparkContext: SparkContext, dateFolderPath: String, options: HendrixHistoricalOfflineProcessorOptions ): Unit ={
var awsS3Client: AmazonS3Client = null;
try {
if ("sandbox".equals(options.environment)) {
val clientConfiguration = new ClientConfiguration()
.withConnectionTimeout(options.awsConnectionTimeout)
.withSocketTimeout(options.awsSocketTimeout)
.withTcpKeepAlive(true)
awsS3Client = S3Client.getAWSConnection(options.awsS3AccessKey, options.awsS3SecretKey, clientConfiguration)
} else {
awsS3Client = S3Client.getAWSConnection
}
/** Validate if destination path has any gzip file if so then just skip that date and process next record **/
LOGGER.info("Validating if the destination folder path is empty: " + dateFolderPath)
var objectListing: ObjectListing = null
var listObjectsRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(options.destinationBucketName).withPrefix(options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
if (objectListing.getObjectSummaries.size > 0) {
LOGGER.warn("Record already present at the below location, so skipping the processing of record for the folder path :" + dateFolderPath.toString)
LOGGER.warn("s3n://" + options.destinationBucketName + "/" + options.s3SignalRootFolder + options.signalType + "/" + dateFolderPath.toString)
return
}
LOGGER.info("Validated the destination folder path :" + dateFolderPath + " and found it to be empty ")
/** End of validation **/
/*Selecting all the files under the source path and iterating*/
counter = 0
listObjectsRequest = new ListObjectsRequest().withBucketName(options.rawBucketName).withPrefix(options.rawS3UploadRootFolder + dateFolderPath.toString)
objectListing = awsS3Client.listObjects(listObjectsRequest)
val rddListOfParquetFileNames = objectListing.getObjectSummaries.asScala.map(_.getKey).toList
rddListOfParquetFileNames.flatMap{key => { processIndividualParquetFileAndUploadToS3(sparkContext, awsS3Client, options, key, dateFolderPath)
"COMPLETED Processing=>"+key;
}}
}catch{
case ex: Exception =>
LOGGER.error("Exception occured while processing records for the path " + dateFolderPath)
LOGGER.error("Exception in Apply Lambda method Message :" + ex.getMessage + "\n Stack Trace :" + ex.getStackTrace)
}finally {
awsS3Client.shutdown
LOGGER.info("JOB Complete ")
}
}
def processIndividualParquetFileAndUploadToS3(sparkContext: SparkContext, awsS3Client: AmazonS3Client, options: HendrixHistoricalOfflineProcessorOptions, parquetFilePath:String, dateFolderPath:String ):Unit ={
try{
LOGGER.info("Currently Processing the Parquet file: "+parquetFilePath)
LOGGER.info("Starting to reading Parquet File Start Time: "+System.currentTimeMillis)
val dataSetString: RDD[String] = ParquetHelper.readParquetData(sparkContext, options, parquetFilePath)
LOGGER.info("Data Set returned from Parquet file Successful Time: "+System.currentTimeMillis)
val lambdaSignalRecords: Array[HotelInfoSignal] = dataSetString.map(x => constructUisObject(x))
.filter(_ != null)
.map(userInteraction => processIndividualRecords(userInteraction, options))
.filter(_ != null)
.collect
LOGGER.info("Successfully Generated "+lambdaSignalRecords.length+" Signal Records")
if(lambdaSignalRecords.length > 0) {
//Write to Paraquet File :Start
val parquetFileName: String = getFileNameForParquet(dateFolderPath, counter)
val parquetWriter = ParquetHelper.newParquetWriter(HotelInfoSignal.getClassSchema, dateFolderPath, parquetFileName, options)
LOGGER.info("Initialized Parquet Writer")
lambdaSignalRecords.map(signalRecord => parquetWriter.write(signalRecord))
LOGGER.info("Completed writing the data in Parquet format")
parquetWriter.close
//Parquet Write Complete
/*val avroSignalString = lambdaSignalRecords.mkString("\n")
val sparkSession = SparkSession.builder.getOrCreate
uploadProceessedDataToS3(sparkSession, awsS3Client, dateFolderPath, avroSignalString, options)
*/ }
}catch {case ex:Exception =>
LOGGER.error("Skipping processing of record :"+parquetFilePath+" because of Exception: "+ExceptionUtils.getFullStackTrace(ex))
}
LOGGER.info("Completed data processing for file :" + options.rawBucketName + options.rawS3UploadRootFolder + parquetFilePath)
}
def uploadProceessedDataToS3(sparkSession:SparkSession, awsS3Client: AmazonS3Client, filePath: String, genericSignalRecords: String, options: HendrixHistoricalOfflineProcessorOptions):Unit ={
var jsonFile: File = null
var gzFile: File = null
try {
//Building the file name based on the folder accessed
fileName = getFileName (filePath, counter)
jsonFile = IOUtil.createS3JsonFile (genericSignalRecords, fileName)
gzFile = IOUtil.gzipIt (jsonFile)
s3Util.uploadToS3(awsS3Client, options.destinationBucketName, options.s3SignalRootFolder + options.signalType + "/" + filePath, gzFile)
counter += 1 //Incement counter
} catch {
case ex: RuntimeException => LOGGER.error ("Exception while uploading file to path :" + options.s3SignalRootFolder + options.signalType + "/" + filePath + "/" + fileName)
LOGGER.error ("Stack Trace for S3 Upload :" + ExceptionUtils.getFullStackTrace(ex))
} finally {
//Cleaning the temp file created after upload to s3, we can create a temp dir if required.
jsonFile.delete
gzFile.delete
}
}
def processIndividualRecords(userInteraction: UserInteraction, options: HendrixHistoricalOfflineProcessorOptions): HotelInfoSignal ={
try {
//Applying lambda for the indivisual UserInteraction
val list: List[GenericRecord] = prepareUisObjectAndApplyLambda (userInteraction, options)
if (list.nonEmpty) return GenericLambdaMapper.populateSignalRecord (list.head, userInteraction, options.signalType)
} catch { case ex: Exception => LOGGER.error ("Error while creating signal record from UserInteraction for Singal Type :"+ options.signalType +" For Interaction "+userInteraction.toString)
LOGGER.error ("Stack Trace while processIndividualRecords :" + ExceptionUtils.getFullStackTrace(ex))}
null
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileName(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".json"
}
/** This method is used to prepare the exact file name which has processed date and the no of files counter **/
def getFileNameForParquet(filePath : String, counter : Int): String = {
filePath.replace("/","-")+"_"+counter+"_"+random.alphanumeric.take(5).mkString+".parquet"
}
}
package com.expedia.www.util
import com.expedia.www.options.HendrixHistoricalOfflineProcessorOptions
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.{AvroParquetWriter, AvroSchemaConverter}
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import org.apache.parquet.hadoop.{ParquetFileWriter, ParquetWriter}
import org.apache.parquet.schema.MessageType
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.slf4j.{Logger, LoggerFactory}
/**
* Created by prasubra on 2/17/17.
*/
object ParquetHelper {
val LOGGER:Logger = LoggerFactory.getLogger("ParquetHelper")
def newParquetWriter(signalSchema: Schema, folderPath:String, fileName:String, options:HendrixHistoricalOfflineProcessorOptions): ParquetWriter[GenericRecord] = {
val blockSize: Int = 256 * 1024 * 1024
val pageSize: Int = 64 * 1024
val compressionCodec = if (options.parquetCompressionToGzip) CompressionCodecName.GZIP else CompressionCodecName.UNCOMPRESSED
val path: Path = new Path("s3n://" + options.destinationBucketName + "/" + options.parquetSignalFolderName + options.signalType + "/" + folderPath + "/" + fileName);
val parquetSchema: MessageType = new AvroSchemaConverter().convert(signalSchema);
// var writeSupport:WriteSupport = new AvroWriteSupport(parquetSchema, signalSchema);
//(path, writeSupport, compressionCodec, blockSize, pageSize)
//var parquetWriter:ParquetWriter[GenericRecord] = new ParquetWriter(path, writeSupport, compressionCodec, blockSize, pageSize);
if ("sandbox".equals(options.environment)) {
val hadoopConf = new Configuration
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", options.awsS3AccessKey)
hadoopConf.set("fs.s3n.awsSecretAccessKey", options.awsS3SecretKey)
hadoopConf.set("fs.s3n.maxRetries", options.awsFileReaderRetry)
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withConf(hadoopConf)
.build()
} else {
AvroParquetWriter.builder(path)
.withSchema(signalSchema)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodec)
.withPageSize(pageSize)
.build()
}
}
def readParquetData(sc: SparkContext, options: HendrixHistoricalOfflineProcessorOptions, filePath: String): RDD[String] = {
val filePathOfParquet = "s3n://"+options.rawBucketName+"/"+ filePath
LOGGER.info("Reading Parquet file from path :"+filePathOfParquet)
val sparkSession = SparkSession.builder.getOrCreate
val dataFrame = sparkSession.sqlContext.read.parquet(filePathOfParquet)
//dataFrame.printSchema()
dataFrame.toJSON.rdd
}
}

First, you really should improve your questions, with a minimal code example. It's really hard to see whats going on in your code...
Collect retrieves all elements of your RDD into a single RDD on the driver. If your RDD is large, then this will of course take a lot of time (and maybe cause an OutOfMemeoryError if the content does not fit into the driver's main memory).
You can directly write the content of a Dataframe/Dataset using parquet. This will surely be much faster and more scalable.

Use s3a:// URLs . S3n// has a bug which really kills ORC/Parquet performance, and has been superceded by s3a now

How to use UPDATE command in spark-sql & DataFrames

I am trying to implement UPDATE command on DataFrames in spark. But getting this error. Please put suggestions on what should be done.
17/01/19 11:49:39 INFO Replace$: query --> UPDATE temp SET c2 = REPLACE(c2,"i","a");
17/01/19 11:49:39 ERROR Main$: [1.1] failure: ``with'' expected but identifier UPDATE found
UPDATE temp SET c2 = REPLACE(c2,"i","a");
^
java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier UPDATE found
UPDATE temp SET c2 = REPLACE(c2,"i","a");
This is the program
object Replace extends SparkPipelineJob{
val logger = LoggerFactory.getLogger(getClass)
protected implicit val jsonFormats: Formats = DefaultFormats
def createSetCondition(colTypeMap:List[(String,DataType)], pattern:String, replacement:String):String = {
val res = colTypeMap map {
case (c,t) =>
if(t == StringType)
c+" = REPLACE(" + c + ",\"" + pattern + "\",\"" + replacement + "\")"
else
c+" = REPLACE(" + c + "," + pattern + "," + replacement + ")"
}
return res.mkString(" , ")
}
override def execute(dataFrames: List[DataFrame], sc: SparkContext, sqlContext: SQLContext, params: String, productId: Int) : List[DataFrame] = {
import sqlContext.implicits._
val replaceData = ((parse(params)).extractOpt[ReplaceDataSchema]).get
logger.info(s"Replace-replaceData --> ${replaceData}")
val (inputDf, (columnsMap, colTypeMap)) = (dataFrames(0), LoadInput.colMaps(dataFrames(0)))
val tableName = Constants.TEMP_TABLE
inputDf.registerTempTable(tableName)
val colMap = replaceData.colName map {
x => (x,colTypeMap.get(x).get)
}
logger.info(s"colMap --> ${colMap}")
val setCondition = createSetCondition(colMap,replaceData.input,replaceData.output)
val query = "UPDATE "+tableName+" SET "+setCondition+";"
logger.info(s"query --> ${query}")
val outputDf = sqlContext.sql(query)
List(outputDf)
}
}
Here is some extra information.
17/01/19 11:49:39 INFO Replace$: Replace-replaceData --> ReplaceDataSchema(List(SchemaDetectData(s3n://fakepath/data37.csv,None,None)),List(c2),i,a)
17/01/19 11:49:39 INFO Replace$: colMap --> List((c2,StringType))
data37.csv
c1 c2
90 nine
Please ask for extra information if needed.

Spark SQL doesn't support UPDATE queries. If you want to "modify" the data you should create new table with SELECT:
SELECT * REPLACE(c2, 'i', 'a') AS c2 FROM table

SignatureDoesNotMatch Aws CloudSearch scala

I keep getting:
"#SignatureDoesNotMatch","error":{"message":"[Deprecated: Use the
outer message field] The request signature we calculated does not
match the signature you provided. Check your AWS Secret Access Key and
signing method. Consult the service documentation for details.
from trying to do a get request to cloudsearch. I verified that my Canonical String and String-to-Sign match the ones sent back from the error message everytime now, but I keep getting the error. Im assuming my signature itself isn't being processed correctly. But hard to nail it down.
def getHash(key:Array[Byte]): String = {
try
{
val md = MessageDigest.getInstance("SHA-256").digest(key)
md.map("%02x".format(_)).mkString.toLowerCase()
}
catch
{
case e: Exception => ""
}
}
.
def HmacSHA256(data:String, key:Array[Byte]): Array[Byte] = {
val algorithm="HmacSHA256";
val mac = Mac.getInstance(algorithm);
mac.init(new SecretKeySpec(key, algorithm));
mac.doFinal(data.getBytes("UTF8"));
}
.
...
val algorithm = "AWS4-HMAC-SHA256"
val credential_scope = date + "/us-west-1/cloudsearch/aws4_request"
val string_to_sign = algorithm + "\n" + dateTime + "\n" + credential_scope + "\n" + getHash(canonical_request)
val kSecret = ("AWS4" + config.getString("cloud.secret")).getBytes("utf-8")
val kDate = HmacSHA256(date.toString, kSecret)
val kRegion = HmacSHA256("us-west-1",kDate)
val kService = HmacSHA256("cloudsearch",kRegion)
val kSigning = HmacSHA256("aws4_request",kService)
val signing_key = kSigning
val signature = getHash(HmacSHA256(string_to_sign, kSigning))
val authorization_header = algorithm + " " + "Credential=" + config.getString("cloud.key") + "/" + credential_scope + ", " + "SignedHeaders=" + signed_headers + ", " + "Signature=" + signature
val complexHolder = holder.withHeaders(("x-amz-date",dateTime.toString))
.withHeaders(("Authorization",authorization_header))
.withRequestTimeout(5000)
.get()
val response = Await.result(complexHolder, 10 second)

I just released a helper library to sign your HTTP requests to AWS: https://github.com/ticofab/aws-request-signer . Hope it helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Saving KMeansModel on HDFS - scala

Related

Task not serializable in scala

FileUtil.fullyDeleteContents does not work

spark collect method taking too much time when processing records stored in RDD[String]

How to use UPDATE command in spark-sql & DataFrames

SignatureDoesNotMatch Aws CloudSearch scala

Categories

Resources