How override data in aws glue? - scala

Cosider a code:
val inputTable = glueContext
.getCatalogSource(database = "my_db", tableName = "my_table)
.getDynamicFrame()
glueContext.getSinkWithFormat(
connectionType = "s3",
options = JsonOptions(Map("path" -> "s3://my_out_path")),
format = "orc", transformationContext = ""
).writeDynamicFrame(inputTable)
When I run this code twice new orc files added to old ones in "s3://my_out_path". Is there a way to overwrite always override path?
Note
The writting data have no partition.

Yes, you can use the spark to overwrite the content. You can still read your data with Glue methods, but then change it to a spark dataframe and overwrite the files:
datasink = DynamicFrame.toDF(inputTable)
datasink.write.\
format("orc").\
mode("overwrite").\
save("s3://my_out_path")

Related

Synapse - Notebook not working from Pipeline

I have a notebook in Azure Synapse that reads parquet files into a data frame using the synapsesql function and then pushes the data frame contents into a table in the SQL Pool.
Executing the notebook manually is successful and the table is created and populated in the Synapse SQL pool.
When I try to call the same notebook from an Azure Synapse pipeline it returns successful however does not create the table. I am using the Synapse Notebook activity in the pipeline.
What could be the issue here?
I am getting deprecated warnings around the synapsesql function but don't know what is actually deprecated.
The code is below.
%%spark
val pEnvironment = "t"
val pFolderName = "TestFolder"
val pSourceDatabaseName = "TestDatabase"
val pSourceSchemaName = "TestSchema"
val pRootFolderName = "RootFolder"
val pServerName = pEnvironment + "synas01"
val pDatabaseName = pEnvironment + "syndsqlp01"
val pTableName = pSourceDatabaseName + "" + pSourceSchemaName + "" + pFolderName
// Import functions and Synapse connector
import org.apache.spark.sql.DataFrame
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.functions.
import org.apache.spark.sql.SqlAnalyticsConnector.
// Get list of "FileLocation" from control.FileLoadStatus
val fls:DataFrame = spark.read.
synapsesql(s"${pDatabaseName}.control.FileLoadStatus").
select("FileLocation","ProcessedDate")
// Read all parquet files in folder into data frame
// Add file name as column
val df:DataFrame = spark.read.
parquet(s"/source/${pRootFolderName}/${pFolderName}/").
withColumn("FileLocation", input_file_name())
// Join parquet file data frame to FileLoadStatus data frame
// Exclude rows in parquet file data frame where ProcessedDate is not null
val df2 = df.
join(fls,Seq("FileLocation"), "left").
where(fls("ProcessedDate").isNull)
// Write data frame to sql table
df2.write.
option(Constants.SERVER,s"${pServerName}.sql.azuresynapse.net").
synapsesql(s"${pDatabaseName}.xtr.${pTableName}",Constants.INTERNAL)
This case happens often and to get the output after pipeline execution. Follow the steps mentioned.
Pick up the Apache Spark application name from the output of pipeline
Navigate to Apache Spark Application under Monitor tab and search for the same application name .
These 4 tabs would be available there: Diagnostics,Logs,Input data,Output data
Go to Logs ad check 'stdout' for getting the required output.
https://www.youtube.com/watch?v=ydEXCVVGAiY
Check the above video link for detailed live procedure.

Merging too many small files into single large files in Datalake using Apache Spark

I Have Following Directory Structure In HDFS.
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt
I want to Merge the files DayWise.
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt
I have used below code.
val inputDir="/user/hdfs/landing_zone/year=2021/month=11/"
val hadoopConf = spark.sparkContext.hadoopConfiguration
val hdfsConf = new Configuration();
val fs: FileSystem = FileSystem.get(hdfsConf)
val sc = spark.sparkContext
val baseFolder = new Path(inputDir)
val files = baseFolder.getFileSystem(sc.hadoopConfiguration).listStatus(baseFolder).map(_.getPath.toString)
for (path <- files) {
var Folder_Path = fs.listStatus(new Path(path)).map(_.getPath).toList
for (eachfolder <- Folder_Path) {
var New_Folder_Path: String = eachfolder.toString
var Fs1 = FileSystem.get(spark.sparkContext.hadoopConfiguration)
var FilePath = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath).toList
var NewFiles = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath.getName).toList
"FilePath" : Generating the List of Complete Path for all the files recursively.
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt)
"NewFiles" : - Generating the list of FileNames for all the files recursively
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
Can Someone Suggest/Guide me How should I modify the code so that It can Generate the files DayWise and merge 3 file(1 day=3 files) into a single file (1 day = 1 file) recursively for all the days.
They're are easier ways that getting into low level manipulations.I would suggest "picking the table up and putting it back down"
Literally, create a table based on the files, write it to a new table. This should concatenate the small files. (Without you having to manipulate it)
If you have created a hive table based on this you could use hive to do the work:
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

AWS Glue add new partitions and overwrite existing partitions

I'm attempting to write pyspark code in Glue that lets me update the Glue Catalog by adding new partitions and overwrite existing partitions in the same call.
I read that there is no way to overwrite partitions in Glue so we must use pyspark code similar to this:
final_df.withColumn('year', date_format('date', 'yyyy'))\
.withColumn('month', date_format('date', 'MM'))\
.withColumn('day', date_format('date', 'dd'))\
.write.mode('overwrite')\
.format('parquet')\
.partitionBy('year', 'month', 'day')\
.save('s3://my_bucket/')
However with this method, the Glue Catalog does not get updated automatically so an msck repair table call is needed after each write. Recently AWS released a new feature enableUpdateCatalog, where newly created partitions are immediately updated in the Glue Catalog. The code looks like this:
additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["year", "month", "day"]
dyn_frame_catalog = glueContext.write_dynamic_frame_from_catalog(
frame=partition_dyf,
database = "my_db",
table_name = "my_table",
format="parquet",
additional_options=additionalOptions,
transformation_ctx = "my_ctx"
)
Is there a way to combine these 2 commands or will I need to use the pyspark method with write.mode('overwrite') and run an MSCK REPAIR TABLE my_table on every run of the Glue job?
If you have not already found your answer, I believe the following will work:
DataSink5 = glueContext.getSink(
path = "s3://...",
connection_type = "s3",
updateBehavior = "UPDATE_IN_DATABASE",
partitionKeys = ["year", "month", "day"],
enableUpdateCatalog = True,
transformation_ctx = "DataSink5")
DataSink5.setCatalogInfo(
catalogDatabase = "my_db",
catalogTableName = "my_table")
DataSink5.setFormat("glueparquet")
DataSink5.writeFrame(partition_dyf)

How to create log of folder is read in scala spark

The folder of hdfs is like :
/test/data/2020-03-01/{multiple inside files csv}
/test/data/2020-03-02/{multiple files csv}
/test/data/2020-03-03/{multiple files csv }
i want to read data inside folder one by one not whole by
spark.read.csv("/test/data/*") //not in such manner
Not in above manner , i want to read file one by one; so that i can make the log entry in some database that date folder is read ; so that on next time i can skip that folder in next day or same day if program run accidentally:
val conf = new Configuration()
val iterate = org.apache.hadoop.fs.FileSystem.get(new URI(strOutput), conf).listLocatedStatus(new org.apache.hadoop.fs.Path(strOutput))
while (iterate.hasNext) {
val pathStr = iterate.next().getPath.toString
println("log---->"+pathStr)
val df = spark.read.text(pathStr)
}
Try something like above and read as data frame, if you want you can union new date df with old df.

HBASE Bulk-Load in multiple region for a single table

I am trying to Load data in HBase using BulkLoad. I am also using Scala and Spark to write the code. But every time data is loading on only one single region. I need to load this into multiple region. I have used below code -
Hbase Configuration:
def getConf: Configuration = {
val hbaseSitePath = "/etc/hbase/conf/hbase-site.xml"
val conf = HBaseConfiguration.create()
conf.addResource(new Path(hbaseSitePath))
conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 100)
conf
}
I can load 80GB of Data in only one single region using above mentioned configuration.
But when I am trying load the same amount of data in multiple region with below mentioned configuration getting exception
java.io.IOException: Trying to load more than 32 hfiles to one family
of one region
Updated Configuration -
def getConf: Configuration = {
val conf = HBaseConfiguration.create()
conf.addResource(new Path(hbaseSitePath))
conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 32)
conf.setLong("hbase.hregion.max.filesize", 107374182)
conf.set("hbase.regionserver.region.split.policy","org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy")
conf
}
For saving records I am using below code -
val kv = new KeyValue(Bytes.toBytes(key), columnFamily.getBytes(),
columnName.getBytes(), columnValue.getBytes())
(new ImmutableBytesWritable(Bytes.toBytes(key)), kv)
rdd.saveAsNewAPIHadoopFile(pathToHFile, classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], conf) //Here rdd is the input
val loadFiles = new LoadIncrementalHFiles(conf)
loadFiles.doBulkLoad(new Path(pathToHFile), hTable)
Need Help on this.
You are getting issue because 32 is default value per region. You should define KeyPrefixRegionSplitPolicy to split your files and you can increase increase hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily as below
conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024)
Try also to increase file size as
conf.setLong("hbase.hregion.max.filesize", 107374182)