How to check Azure blob storage path exists or not using scala spark or python spark - scala

Please let me know how to check below blob file exists or not.
file path name : "wasbs://containername#storageaccountname.blob.core.windows.net/directoryname/meta_loaddate=20190512/"

Below is a code that would work for you, feel to free to edit/customize per your needs:
from azure.storage.blob import BlockBlobService
session = SparkSession.builder.getOrCreate() #setup spark session
session.conf.set("fs.azure.account.key.storage-account-name.blob.core.windows.net","<storage-account-key>")
sdf = session.read.parquet("wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<prefix>")
block_blob_service = BlockBlobService(account_name='', account_key='')
def blob_exists():
container_name = ""
blob_name = ""
exists=(block_blob_service.exists(container_name, blob_name))
return exists
blobstat = blob_exists()
print(blobstat)# will return a boolean if the blob exists = True, else False

private val storageConnectionString = s"DefaultEndpointsProtocol=http;AccountName=$account;AccountKey=$accessKey"
private val cloudStorageAccount = CloudStorageAccount.parse(storageConnectionString)
private val serviceClient = cloudStorageAccount.createCloudBlobClient
private val container = serviceClient.getContainerReference("data")
val ref = container.getBlockBlobReference(path)
val existOrNot = ref.exist()

Related

How to resolve Wrong FS: hdfs:/..., expected: file:/// to get the list of subdirectory with a Path (String)?

I'm trying to get the latest subdirectory name created in my directory "DWP".
I managed to execute this code with local path but running it in my hdfs cluster I have the error "Wrong FS: hdfs:/..., expected: file:///"
def lastDirectoryHour(): String = {
val env = System.getenv("IP_HDFS")
val currentDate = DateTimeFormatter.ofPattern("YYYY-MM-dd").format(java.time.LocalDate.now)
val readingPath = "hdfs://".concat(env).concat(":9000/user/bronze/json/DWP/").concat(currentDate).concat("/")
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(readingPath))
var listDir = ListBuffer[Long]()
var DirName: String = ""
for(value<-status)
{
listDir = listDir += value.getModificationTime
}
for(value<-status)
{
if(value.getModificationTime == listDir.max) {
DirName = value.getPath.getName
}
}
readingPath.concat(DirName)
}
When I add "addRessource" as some answers say, I'm unable to use "listStatus" which return the name.
Do you know how can I change my code in order to keep it returning me the latest subdirectory name ?
Thank you very much in advance.
You need to set the value of fs.defaultFS in your configuration object as the address of the namenode. Most probably till 9000 in your reading path.
Second option could be.
Get the FileSystem from the path,
Filesystem fs = new Path(readingPath).getFileSystem(new Configuration())

How to move files from one S3 bucket directory to another directory in same bucket? Scala/Java

I want to move all files under a directory in my s3 bucket to another directory within the same bucket, using scala.
Here is what I have:
def copyFromInputFilesToArchive(spark: SparkSession) : Unit = {
val sourcePath = new Path("s3a://path-to-source-directory/")
val destPath = new Path("s3a:/path-to-destination-directory/")
val fs = sourcePath.getFileSystem(spark.sparkContext.hadoopConfiguration)
fs.moveFromLocalFile(sourcePath,destPath)
}
I get this error:
fs.copyFromLocalFile returns Wrong FS: s3a:// expected file:///
Error explained
The error you are seeing is because the copyFromLocalFile method is really for moving files from a local filesystem to S3. You are trying to "move" files that are already both in S3.
It is important to note that directories don't really exist in Amazon S3 buckets - The folder/file hierarchy you see is really just key-value metadata attached to the file. All file objects are really sitting in the same big, single level container and that filename key is there to give the illusion of files/folders.
To "move" files in a bucket, what you really need to do is update the filename key with the new path which is really just editing object metadata.
How to do a "move" within a bucket with Scala
To accomplish this, you'd need to copy the original object, assign the new metadata to the copy, and then write it back to S3. In practice, you can copy it and save it to the same object which will overwrite the old version, which acts a lot like an update.
Try something like this (from datahackr):
/**
* Copy object from a key to another in multiparts
*
* #param sourceS3Path S3 object key
* #param targetS3Path S3 object key
* #param fromBucketName bucket name
* #param toBucketName bucket name
*/
#throws(classOf[Exception])
#throws(classOf[AmazonServiceException])
def copyMultipart(sourceS3Path: String, targetS3Path: String, fromS3BucketName: String, toS3BucketName: String) {
// Create a list of ETag objects. You retrieve ETags for each object part uploaded,
// then, after each individual part has been uploaded, pass the list of ETags to
// the request to complete the upload.
var partETags = new ArrayList[PartETag]();
// Initiate the multipart upload.
val initRequest = new InitiateMultipartUploadRequest(toS3BucketName, targetS3Path);
val initResponse = s3client.initiateMultipartUpload(initRequest);
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromS3BucketName);
var objectSize = metadataResult.getContentLength();
// Copy the object using 50 MB parts.
val partSize = (50 * 1024 * 1024) * 1L;
var bytePosition = 0L;
var partNum = 1;
var copyResponses = new ArrayList[CopyPartResult]();
while (bytePosition < objectSize) {
// The last part might be smaller than partSize, so check to make sure
// that lastByte isn't beyond the end of the object.
val lastByte = Math.min(bytePosition + partSize - 1, objectSize - 1);
// Copy this part.
val copyRequest = new CopyPartRequest()
.withSourceBucketName(fromS3BucketName)
.withSourceKey(sourceS3Path)
.withDestinationBucketName(toS3BucketName)
.withDestinationKey(targetS3Path)
.withUploadId(initResponse.getUploadId())
.withFirstByte(bytePosition)
.withLastByte(lastByte)
.withPartNumber(partNum + 1);
partNum += 1;
copyResponses.add(s3client.copyPart(copyRequest));
bytePosition += partSize;
}
// Complete the upload request to concatenate all uploaded parts and make the copied object available.
val completeRequest = new CompleteMultipartUploadRequest(
toS3BucketName,
targetS3Path,
initResponse.getUploadId(),
getETags(copyResponses));
s3client.completeMultipartUpload(completeRequest);
logger.info("Multipart upload complete.");
}
// This is a helper function to construct a list of ETags.
def getETags(responses: java.util.List[CopyPartResult]): ArrayList[PartETag] = {
var etags = new ArrayList[PartETag]();
val it = responses.iterator();
while (it.hasNext()) {
val response = it.next();
etags.add(new PartETag(response.getPartNumber(), response.getETag()));
}
return etags;
}
def moveObject(sourceS3Path: String, targetS3Path: String, fromBucketName: String, toBucketName: String) {
logger.info(s"Moving S3 frile from $sourceS3Path ==> $targetS3Path")
// Get the object size to track the end of the copy operation.
var metadataResult = getS3ObjectMetadata(sourceS3Path, fromBucketName);
var objectSize = metadataResult.getContentLength();
if (objectSize > ALLOWED_OBJECT_SIZE) {
logger.info("Object size is greater than 1GB. Initiating multipart upload.");
copyMultipart(sourceS3Path, targetS3Path, fromBucketName, toBucketName);
} else {
s3client.copyObject(fromBucketName, sourceS3Path, toBucketName, targetS3Path);
}
// Delete source object after successful copy
s3client.deleteObject(fromS3BucketName, sourceS3Path);
}
You will need the AWS Sdk for this.
If you are using AWS Sdk Version 1,
projectDependencies ++= Seq(
"com.amazonaws" % "aws-java-sdk-s3" % "1.12.248"
)
import com.amazonaws.services.s3.transfer.{ Copy, TransferManager, TransferManagerBuilder }
val transferManager: TransferManager =
TransferManagerBuilder.standard().build()
def copyFile(): Unit = {
val copy: Copy =
transferManager.copy(
"source-bucket-name", "source-file-key",
"destination-bucket-name", "destination-file-key"
)
copy.waitForCompletion()
}
If you are using AWS Sdk Version 2
projectDependencies ++= Seq(
"software.amazon.awssdk" % "s3" % "2.17.219",
"software.amazon.awssdk" % "s3-transfer-manager" % "2.17.219-PREVIEW"
)
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.model.CopyObjectRequest
import software.amazon.awssdk.transfer.s3.{Copy, CopyRequest, S3ClientConfiguration, S3TransferManager}
// change Region.US_WEST_2 to your required region
// or it might even work without the whole `.region(Region.US_WEST_2)` thing
val s3ClientConfig: S3ClientConfiguration =
S3ClientConfiguration
.builder()
.region(Region.US_WEST_2)
.build()
val s3TransferManager: S3TransferManager =
S3TransferManager.builder().s3ClientConfiguration(s3ClientConfig).build()
def copyFile(): Unit = {
val copyObjectRequest: CopyObjectRequest =
CopyObjectRequest
.builder()
.sourceBucket("source-bucket-name")
.sourceKey("source-file-key")
.destinationBucket("destination-bucket-name")
.destinationKey("destination-file-key")
.build()
val copyRequest: CopyRequest =
CopyRequest
.builder()
.copyObjectRequest(copyObjectRequest)
.build()
val copy: Copy =
s3TransferManager.copy(copyRequest)
copy.completionFuture().get()
}
Keep in mind that you will need the AWS credentials with appropriate permissions for both the source and destination object. For this, You just need to get the credentials and make them available as following environment variables.
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_SESSION_TOKEN=your_session_token
Also, "source-file-key" and "destination-file-key" should be the full path of the file in the bucket.

decompress (unzip/extract) util using spark scala

I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala
gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))
I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}

How to read a .tar file containing parquets on S3 as dataframes in Spark?

I need to load a .tar file on S3 that contains multiple parquets with different schema using Scala/Spark. Ideally I'd like to read one of these parquets into Spark dataframe. I tried to get the s3 object and then convert to a tar input stream using org.apache.commons.compress.archivers.tar.TarArchiveInputStream and it was able to creat the tar input stream but failed to read the tar entries.
val s3client: AmazonS3 = AmazonS3ClientBuilder.
standard().
withCredentials(new InstanceProfileCredentialsProvider()).
withRegion(my_region).
build();
val tarFile = s3client.getObject(my_bucket, my_tar_file)
val tarInputStream = new TarArchiveInputStream(tarFile.getObjectContent)
tarInputStream.getNextTarEntry() <-- error thrown in this line
Error:
java.io.IOException: Error detected parsing the header
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:240)
... 52 elided
Caused by: java.lang.IllegalArgumentException: Invalid byte 48 at offset 7 in '00755{NUL}00' len=8
at org.apache.commons.compress.archivers.tar.TarUtils.parseOctal(TarUtils.java:127)
at org.apache.commons.compress.archivers.tar.TarUtils.parseOctalOrBinary(TarUtils.java:171)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:935)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.parseTarHeader(TarArchiveEntry.java:924)
at org.apache.commons.compress.archivers.tar.TarArchiveEntry.<init>(TarArchiveEntry.java:328)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:238)
Does anyone have the knowledge of the proper way of extract a partial of tar file on s3 in Spark?
Follow this example. I hope you are using tar.gz
AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
AWSCredentialsProvider credentialsProvider = new AWSStaticCredentialsProvider(credentials);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withRegion(Regions.US_EAST_1).withCredentials(credentialsProvider).build();
S3Object object = s3Client.getObject("bucketname", "file.tar.gz");
S3ObjectInputStream objectContent = object.getObjectContent();
TarArchiveInputStream tarInputStream = new TarArchiveInputStream(new GZIPInputStream(objectContent));
TarArchiveEntry currentEntry;
while((currentEntry = tarInputStream.getNextTarEntry()) != null) {
if(currentEntry.getName().equals("1/foo.bar") && currentEntry.isFile()) {
FileOutputStream entryOs = new FileOutputStream("foo.bar");
IOUtils.copy(tarInputStream, entryOs);
entryOs.close();
break;
}
}
objectContent.abort(); // Warning at this line
tarInputStream.close(); // warning at this line
scala equivalent is
val credentials: AWSCredentials =
new BasicAWSCredentials("accessKey", "secretKey")
val credentialsProvider: AWSCredentialsProvider =
new AWSStaticCredentialsProvider(credentials)
val s3Client: AmazonS3 = AmazonS3ClientBuilder
.standard()
.withRegion(Regions.US_EAST_1)
.withCredentials(credentialsProvider)
.build()
val s3object: S3Object = s3Client.getObject("bucketname", "file.tar.gz")
val objectContent: S3ObjectInputStream = s3object.getObjectContent
val tarInputStream: TarArchiveInputStream = new TarArchiveInputStream(
new GZIPInputStream(objectContent))
var currentEntry: TarArchiveEntry = null
while ((currentEntry = tarInputStream.getNextTarEntry) != null)
if (currentEntry.getName ==("1/foo.bar") && currentEntry.isFile) {
val entryOs: FileOutputStream = new FileOutputStream("foo.bar")
IOUtils.copy(tarInputStream, entryOs)
entryOs.close()
}
objectContent.abort()
tarInputStream.close()
}
Update :
since you are using only tar not gzip
so you have to read like this...
val tarInputStream = new TarArchiveInputStream(new FileInputStream(
tarFile.getObjectContent))
In your case you are passing object as a InputStream. My suggestion is to pass it as a GzipInputstream, then read entries:
val tarInputStream = new TarArchiveInputStream(tarFile.getObjectContent)
val tarInputStream = new TarArchiveInputStream(new GZIPInputStream(tarFile))
val entry: TarArchiveEntry = readEntries(tarInputStream)
def readEntries(tarInputStream: TarArchiveInputStream): TarArchiveEntry = {
var currentEntry = Option(tarInputStream.getNextTarEntry())
// you can use functional approach with foldLeft, reduce or something else or while loop
// implementation details here
}
You can find how to use TarArchiveInputStream usage here
You can use GetObjectRequest to create an S3Object
val s3FullObject: S3Object = s3client.getObject(new GetObjectRequest(s3Bucket, s3TarPath))
val tis = new TarArchiveInputStream(s3FullObject.getObjectContent)
var entry: TarArchiveEntry = tis.getNextTarEntry

In Spark,while writing dataset into database it takes some pre-assumed time for save operation

I ran the spark-submit command as mentioned below,which performs the Datasets loading from DB,processing,and in final stage it push the multiple datasets into Oracle DB.
./spark-submit --class com.sample.Transformation --conf spark.sql.shuffle.partitions=5001
--num-executors=40 --executor-cores=1 --executor-memory=5G
--jars /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-api-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/drools-core-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/drools-compiler-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-maven-support-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-internal-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/xstream-1.4.10.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-commons-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/ecj-4.4.2.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/mvel2-2.4.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-commons-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-api-7.7.0.Final.jar
--driver-class-path /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar
--master spark://10.180.181.41:7077 "/scratch/rmbbuild/spark_ormb/POC-jar/Transformation-0.0.1-SNAPSHOT.jar"
> /scratch/rmbbuild/spark_ormb/POC-jar/logs/logs12.txt
But,it takes some pre-assumed time while writing the dataset into the DB,don't know why it is consuming this long time before starting the write process.
Attaching the screenshot which clearly highlights the problem which i am facing.
Please go through the screenshot before commenting out the solution.
Spark Dashboard Stages Screenshot:
If we look at the screenshot I have highlighted the timing of around 10mins,which is consumed before every dataset write into the DB.
Even I changed the batchsize to 100000 as such follows:
outputDataSetforsummary.write().mode("append").format("jdbc").option("url", connection)
.option("batchSize", "100000").option("dbtable", CI_TXN_DTL).save();
So,if any one can explain out why this pre-write time in consumed everytime,and how to avoid this timings.
I am attaching the code for more description of the program.
public static void main(String[] args) {
SparkConf conf = new
// SparkConf().setAppName("Transformation").setMaster("local");
SparkConf().setAppName("Transformation").setMaster("spark://xx.xx.xx.xx:7077");
String connection = "jdbc:oracle:thin:ABC/abc#//xx.x.x.x:1521/ABC";
// Create Spark Context
SparkContext context = new SparkContext(conf);
// Create Spark Session
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> txnDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", CI_TXN_DETAIL_STG).load();
//Dataset<Row> txnDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", "CI_TXN_DETAIL_STG").load();
Dataset<Row> newTxnDf = txnDf.drop(ACCT_ID);
Dataset<Row> accountDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", CI_ACCT_NBR).load();
// Dataset<Row> accountDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", "CI_ACCT_NBR").load();
Dataset<Row> joined = newTxnDf.join(accountDf, newTxnDf.col(ACCT_NBR).equalTo(accountDf.col(ACCT_NBR))
.and(newTxnDf.col(ACCT_NBR_TYPE_CD).equalTo(accountDf.col(ACCT_NBR_TYPE_CD))), "inner");
Dataset<Row> finalJoined = joined.drop(accountDf.col(ACCT_NBR_TYPE_CD)).drop(accountDf.col(ACCT_NBR))
.drop(accountDf.col(VERSION)).drop(accountDf.col(PRIM_SW));
initializeProductDerivationCache(sparkSession,connection);
ClassTag<List<String>> evidenceForDivision = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForDiv = context.broadcast(divisionList, evidenceForDivision);
ClassTag<List<String>> evidenceForCurrency = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForCurrency = context.broadcast(currencySet, evidenceForCurrency);
ClassTag<List<String>> evidenceForUserID = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForUserID = context.broadcast(userIdList, evidenceForUserID);
Encoder<RuleParamsBean> encoder = Encoders.bean(RuleParamsBean.class);
Dataset<RuleParamsBean> ds = new Dataset<RuleParamsBean>(sparkSession, finalJoined.logicalPlan(), encoder);
Dataset<RuleParamsBean> validateDataset = ds.map(ruleParamsBean -> validateTransaction(ruleParamsBean,broadcastVarForDiv.value(),broadcastVarForCurrency.value(),
broadcastVarForUserID.value()),encoder);
Dataset<RuleParamsBean> filteredDS = validateDataset.filter(validateDataset.col(BO_STATUS_CD).notEqual(TFMAppConstants.TXN_INVALID));
//For formatting the data to be inserted in table --> Dataset<Row>finalvalidateDataset = validateDataset.select("ACCT_ID");
Encoder<TxnDetailOutput>txndetailencoder = Encoders.bean(TxnDetailOutput.class);
Dataset<TxnDetailOutput>txndetailDS =validateDataset.map(ruleParamsBean ->outputfortxndetail(ruleParamsBean),txndetailencoder );
KieServices ks = KieServices.Factory.get();
KieContainer kContainer = ks.getKieClasspathContainer();
ClassTag<KieBase> classTagTest = scala.reflect.ClassTag$.MODULE$.apply(KieBase.class);
Broadcast<KieBase> broadcastRules = context.broadcast(kContainer.getKieBase(KIE_BASE), classTagTest);
Encoder<PritmRuleOutput> outputEncoder = Encoders.bean(PritmRuleOutput.class);
Dataset<PritmRuleOutput> outputDataSet = filteredDS.flatMap(rulesParamBean -> droolprocesMap(broadcastRules.value(), rulesParamBean), outputEncoder);
Dataset<Row>piParamDS1 =outputDataSet.select(PRICEITEM_PARM_GRP_VAL);
Dataset<Row> piParamDS = piParamDS1.withColumnRenamed(PRICEITEM_PARM_GRP_VAL, PARM_STR);
priceItemParamGrpValueCache.createOrReplaceTempView("temp1");
Dataset<Row>piParamDSS = piParamDS.where(queryToFiltertheDuplicateParamVal);
Dataset<Row> priceItemParamsGrpDS = piParamDSS.select(PARM_STR).distinct().withColumn(PRICEITEM_PARM_GRP_ID, functions.monotonically_increasing_id());
Dataset<Row> finalpriceItemParamsGrpDS = priceItemParamsGrpDS.withColumn(PARM_COUNT, functions.size(functions.split(priceItemParamsGrpDS.col(PARM_STR),TOKENIZER)));
finalpriceItemParamsGrpDS.persist(StorageLevel.MEMORY_ONLY());
finalpriceItemParamsGrpDS.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP_K).option("batchSize", "1000").save();
Dataset<Row> PritmOutput = outputDataSet.join(priceItemParamsGrpDS,outputDataSet.col(PRICEITEM_PARM_GRP_VAL).equalTo(priceItemParamsGrpDS.col(PARM_STR)),"inner");
Dataset<Row> samplePritmOutput = PritmOutput.drop(outputDataSet.col(PRICEITEM_PARM_GRP_ID))
.drop(priceItemParamsGrpDS.col(PARM_STR));
priceItemParamsGrpDS.createOrReplaceTempView(PARM_STR);
Dataset<Row> priceItemParamsGroupTable =sparkSession.sql(FETCH_QUERY_TO_SPLIT);
Dataset<Row> finalpriceItemParamsGroupTable = priceItemParamsGroupTable.selectExpr("PRICEITEM_PARM_GRP_ID","split(col, '=')[0] as PRICEITEM_PARM_CD ","split(col, '=')[1] as PRICEITEM_PARM_VAL");
finalpriceItemParamsGroupTable.persist(StorageLevel.MEMORY_ONLY());
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP).option("batchSize", "1000").save();
}
It reloads the whole data and joins data frames again and again in every write to db action.
Please add validateDataset.persist(StorageLevel.MEMORY_ONLY()) - (you should consider on mem or on disk or on mem_and_disk on your own depends on your data frame size . Is it fit in mem or not )
For example:
Dataset<RuleParamsBean> validateDataset = ds.map(ruleParamsBean -> validateTransaction(ruleParamsBean,broadcastVarForDiv.value(),broadcastVarForCurrency.value(),broadcastVarForUserID.value()),encoder)
.persist(StorageLevel.MEMORY_ONLY());