Data download in parallel with spark

Data download in parallel with spark - scala

So I had a url link for a file, and the file is encrypted using "AES/CBC/PKCS5Padding", I have the cipher key and the iv string. I am using amazon ec2 to download the data and store to s3. what I am doing currently is creating an input stream using HTTPCLIENT() and GETMETHOD() and then javax.crypto library on the input stream and then finally putting this inputstream in a s3 location. Thia works for one file; but I need to scale up and do the same thing for multiple such url. How or can I take help of parallel download?
the download time for say n files is same if I don't parallelize the array of url, if I use 1 node or 4 nodes(1 master and 3 slaves). And if I use .par, it gives out of memory error. Here is the code for the data download and upload(the function downloadAndProcessDataFunc)
val client = new HttpClient()
val s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.EU_WEST_1).build()
val method = new GetMethod(url)
val metadata = new ObjectMetadata()
val responseCode = client.executeMethod(method);
println("FROM SCRIPT -> ", destFile, responseCode)
val istream = method.getResponseBodyAsStream();
val base64 = new Base64();
val key = base64.decode(cipherKey.getBytes());
val ivKey = base64.decode(iv.getBytes());
val secretKey = new SecretKeySpec( key, "AES" );
val ivSpec = new IvParameterSpec(ivKey);
val cipher = Cipher.getInstance( algorithm )
cipher.init( Cipher.DECRYPT_MODE, secretKey, ivSpec );
val cistream = new CipherInputStream( istream, cipher );
s3Client.putObject(<bucket>, destFile, cistream, metadata)
this is the call I do, but isn't parallelized
manifestArray.foreach{row => downloadAndProcessDataFunc(row.split(",")(0), row.split(",")(1), row.split(",")(2), row.split(",")(3), row.split(",")(4), row.split(",")(5).toLong)
this runs out of memory
manifest.par.foreach{row => downloadAndProcessDataFunc(row.mkString(",").split(",")(0), row.mkString(",").split(",")(1), row.mkString(",").split(",")(2), row.mkString(",").split(",")(3), element.replace("s3://testbucketforemrupload/manifest/", "data/") + "/" + row.mkString(",").split(",")(4).split("/").last)
now if i change the downloadAndProcessDataFunc function to this, it just downloads around 30/1536 such url and kills the rest of tje executors for out of memory error
val client = new HttpClient()
val s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.EU_WEST_1).build()
val method = new GetMethod(url)
val metadata = new ObjectMetadata()
val responseCode = client.executeMethod(method);
println("FROM SCRIPT -> ", destFile, responseCode)
val istream = method.getResponseBodyAsStream();
val base64 = new Base64();
val key = base64.decode(cipherKey.getBytes());
val ivKey = base64.decode(iv.getBytes());
val secretKey = new SecretKeySpec( key, "AES" );
val ivSpec = new IvParameterSpec(ivKey);
val cipher = Cipher.getInstance( algorithm )
cipher.init( Cipher.DECRYPT_MODE, secretKey, ivSpec );
val cistream = new CipherInputStream( istream, cipher );
metadata.setContentLength(sizeInBytes);
s3Client.putObject(<bucket>, destFile, cistream, metadata)
the variable are self explanatory

Related

dbutils.secrets.get- NoSuchElementException: None.get

The below code executes a 'get' api method to retrieve objects from s3 and write to the data lake.
The problem arises when I use dbutils.secrets.get to get the keys required to establish the connection to s3
my_dataframe.rdd.foreachPartition(partition => {
val AccessKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-ID")
val SecretKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-Secret")
val creds = new BasicAWSCredentials(AccessKey, SecretKey)
val clientRegion: Regions = Regions.US_EAST_1
val s3client = AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new AWSStaticCredentialsProvider(creds))
.build()
partition.foreach(x => {
val objectKey = x.getString(0)
val i = s3client.getObject(s3bucketName, objectKey).getObjectContent
val inputS3String = IOUtils.toString(i, "UTF-8")
val filePath = s"${data_lake_get_path}"
val file = new File(filePath)
val fileWriter = new FileWriter(file)
val bw = new BufferedWriter(fileWriter)
bw.write(inputS3String)
bw.close()
fileWriter.close()
})
})
The above results in the error:-
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.sc$lzycompute(SecretUtilsImpl.scala:24)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.sc(SecretUtilsImpl.scala:24)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.getSecretManagerClient(SecretUtilsImpl.scala:36)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.getBytesInternal(SecretUtilsImpl.scala:46)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.get(SecretUtilsImpl.scala:61)
When the actual secret scope values for AccessKey and SecretKey are passed the above code works fine.
How can this work using dbutils.secrets.get so that keys are not exposed in the code?

You just need to move following two lines:
val AccessKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-ID")
val SecretKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-Secret")
outside of the foreachPartition block, so these functions will be executed in context of driver, and sent to the worker nodes.

Spark Error:- "value foreach is not a member of Object"

The dataframe consists of the two columns (s3ObjectName, batchName) with tens of thousands of rows like:-
s3ObjectName
batchName
a1.json
45
b2.json
45
c3.json
45
d4.json
46
e5.json
46
The objective is to retrieve objects from an S3 bucket and write to datalake in parallel using details from each row in the dataframe using foreachPartition() and foreach() functions
// s3 connector details defined as an object so it can be serialized and available on all executors in the cluster
object container {
def getDataSource() = {
val AccessKey = dbutils.secrets.get(scope = "ADBTEL_Scope", key = "Telematics-TrueMotion-AccessKey-ID")
val SecretKey = dbutils.secrets.get(scope = "ADBTEL_Scope", key = "Telematics-TrueMotion-AccessKey-Secret")
val creds = new BasicAWSCredentials(AccessKey, SecretKey)
val clientRegion: Regions = Regions.US_EAST_1
AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new AWSStaticCredentialsProvider(creds))
.build()
}
}
dataframe.foreachPartition(partition => {
//Initialize s3 connection for each partition
val client: AmazonS3 = container.getDataSource()
partition.foreach(row => {
val s3ObjectName = row.getString(0)
val batchname = row.getString(1)
val inputS3Stream = client.getObject("s3bucketname", s3ObjectName).getObjectContent
val inputS3String = IOUtils.toString(inputS3Stream, "UTF-8")
val filePath = s"/dbfs/mnt/test/${batchname}/${s3ObjectName}"
val file = new File(filePath)
val fileWriter = new FileWriter(file)
val bw = new BufferedWriter(fileWriter)
bw.write(inputS3String)
bw.close()
fileWriter.close()
})
})
The above process gives me
Error: value foreach is not a member of Object

Convert Dataframe to RDD before calling foreachPartition.
dataframe.rdd.foreachPartition(partition => {
})

Fail to write to S3

I am trying to write a file to Amazon S3.
val creds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY)
val amazonS3Client = new AmazonS3Client(creds)
val filePath = "/service2/2019/06/30/21"
val fileContent = "{\"key\":\"value\"}"
val meta = new ObjectMetadata();
amazonS3Client.putObject(bucketName, filePath, new ByteArrayInputStream(fileContent.getBytes), meta)
The program is finished with no error, but no file is written into the bucket.

The key argument seems to have a typo. Try without the initial forward slash
val filePath = "service2/2019/06/30/21"
instead of
val filePath = "/service2/2019/06/30/21"

Scala cipher AES decryption does not work -- conditionally unable to decrypt file -- padding error

Scala 2.11.8, help with encryption... This follows principles from this stackoverflow site and gives an error (javax.crypto.BadPaddingException: Given final block not properly padded). I know why the error is caused, but, need help in handling it. Note: error occurs when the decrypt code is executed separately (i.e. on a different window, separate spark-shell instance). Rarely, the error occurs when both encrypt and decrypt are in the same instance... The salt and IvSpec are copied to the separate instance --as shown at the end (Note: I have verified that the bytes are the same in both instances)...
import java.io.{BufferedWriter, File, FileWriter, FileInputStream, FileOutputStream, BufferedInputStream, BufferedOutputStream, DataInputStream, DataOutputStream}
import org.apache.commons.io.FileUtils;
import javax.crypto.{Cipher, SecretKey, SecretKeyFactory, CipherInputStream, CipherOutputStream}
import javax.crypto.spec.{IvParameterSpec, SecretKeySpec, PBEKeySpec}
import java.security.SecureRandom
import scala.util.Random
import scala.math.pow
val password = "Let us test this"
val random = new SecureRandom();
val salt = Array.fill[Byte](16)(0)
random.nextBytes(salt)
val IvSpec1 = Array.fill[Byte](16)(0)
random.nextBytes(IvSpec1)
val IvSpec = new IvParameterSpec(IvSpec1)
def password_to_key(password: String, salt: Array[Byte]): SecretKeySpec = {
val spec = new PBEKeySpec(password.toCharArray(), salt, 65536, 256);
val f = SecretKeyFactory.getInstance("PBKDF2WithHmacSHA1");
val key = f.generateSecret(spec).getEncoded()
new SecretKeySpec(key, "AES")
}
val Key = password_to_key(password, salt)
val Algorithm = "AES/CBC/PKCS5Padding"
val cipher_encrypt = Cipher.getInstance(Algorithm)
cipher_encrypt.init(Cipher.ENCRYPT_MODE, Key, IvSpec)
val cipher_decrypt = Cipher.getInstance(Algorithm)
cipher_decrypt.init(Cipher.DECRYPT_MODE, Key, IvSpec)
def encrypt(file_in: String, file_out: String, cipher_encrypt:javax.crypto.Cipher ) {
val in = new BufferedInputStream(new FileInputStream(file_in))
val out = new BufferedOutputStream(new FileOutputStream(file_out))
val out_encrypted = new CipherOutputStream(out, cipher_encrypt)
val bufferSize = 1024 * pow(2,4).toInt
val bb = new Array[Byte](bufferSize)
var bb_read = in.read(bb, 0, bufferSize)
while (bb_read > 0) {
out_encrypted.write(bb, 0, bb_read)
bb_read = in.read(bb, 0, bufferSize)
}
in.close()
out_encrypted.close()
out.close()
}
def decrypt(file_in: String, file_out: String, cipher_decrypt:javax.crypto.Cipher ) {
val in = new BufferedInputStream(new FileInputStream(file_in))
val in_decrypted = new CipherInputStream(in, cipher_decrypt)
val out = new BufferedOutputStream(new FileOutputStream(file_out))
val bufferSize = 1024 * pow(2,4).toInt
val bb = new Array[Byte](bufferSize)
var bb_read = in_decrypted.read(bb, 0, bufferSize)
while (bb_read >0 ) {
out.write(bb, 0, bb_read)
bb_read = in_decrypted.read(bb, 0, bufferSize)
}
in_decrypted.close()
in.close()
out.close()
}
val file_in = "test.csv"
val file_encrypt = "test_encrypt.csv"
val file_decrypt = "test_decrypt.csv"
encrypt(file_in, file_encrypt, cipher_encrypt)
decrypt( file_encrypt, file_decrypt, cipher_decrypt)
// To write salt, IvSpec (to re-read it in a separate instance...)
val salt_loc = new File("salt.txt")
val IvSpec_loc = new File("IvSpec.txt")
val salt_w = new FileOutputStream(salt_loc)
salt_w.write(salt)
salt_w.close()
val IvSpec_w = new FileOutputStream(IvSpec_loc)
IvSpec_w.write(IvSpec1)
IvSpec_w.close()
//to re-read salt and IvSpec in a separate instance...
//Ignore that we do not need to re-read IvSpec
val salt_loc = new File("salt.txt")
val IvSpec_loc = new File("IvSpec.txt")
val salt_r = new FileInputStream(salt_loc)
val salt = Stream.continually(salt_r.read).takeWhile(-1 !=).map(_.toByte).toArray
val IvSpec_r = new FileInputStream(IvSpec_loc)
val IvSpec1 = Stream.continually(IvSpec_r.read).takeWhile(-1 !=).map(_.toByte).toArray
val IvSpec = new IvParameterSpec(IvSpec1)
When the decrypt code is executed in a separate java process, this definitely gives an error (error is related to padding). This works most times (95%+), when encrypt and decrypt are done in the same script (e.g. above will work most of the time - if executed in the same instance). If decrypt is done separately, by getting the salt and IvSpec in a separate window/process/thread/java instance, it fails.
error is java.io.IOException: javax.crypto.BadPaddingException: Given final block not properly padded
at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:121)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:239)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:215)
at decrypt3(<console>:81)
... 60 elided
Caused by: javax.crypto.BadPaddingException: Given final block not properly padded
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:991)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:847)
at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
at javax.crypto.Cipher.doFinal(Cipher.java:2047)
at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:118)
... 63 more

Bad Request on AWS ElasticSearch

I'm trying to connect to an IAM controlled ElasticSearch domain, I've created a request, and signed it, and everything works fine for method GET, but on method POST I get a 400 Bad Request
This clearly has something to do with the payload. If I provide a payload empty string ("") it works apporpriately, but anything else results in a bad request.
What am I missing?
val url = s"https://$host/TEST/article/_search"
val serviceName = "es"
val regionName = "us-east-1"
val request = new DefaultRequest(serviceName)
val payload =
"""{"1":"1"}""".trim
val payloadBytes = payload.getBytes(StandardCharsets.UTF_8)
val payloadStream = new ByteArrayInputStream(payloadBytes)
request.setContent(payloadStream)
val endpointUri = URI.create(url)
request.setEndpoint(endpointUri)
request.setHttpMethod(HttpMethodName.POST)
val credProvider = new EnvironmentVariableCredentialsProvider
val credentials = credProvider.getCredentials
val signer = new AWS4Signer
signer.setRegionName(regionName)
signer.setServiceName(serviceName)
signer.sign(request, credentials)
val context = new ExecutionContext(true)
val clientConfiguration = new ClientConfiguration()
val client = new AmazonHttpClient(clientConfiguration)
val rh = new MyHttpResponseHandler
val eh = new MyErrorHandler
val response =
client.execute(request, rh , eh, context);

note: if you run into this problem, inspect the actual content of the response, it may be a result of a mismatch between the index and your query.
My problem was that the specific query I was using was inappropriate for the specified index, and that resulted in a 400

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Data download in parallel with spark - scala

Related

dbutils.secrets.get- NoSuchElementException: None.get

Spark Error:- "value foreach is not a member of Object"

Fail to write to S3

Scala cipher AES decryption does not work -- conditionally unable to decrypt file -- padding error

Bad Request on AWS ElasticSearch

Categories

Resources