Merging files within the Azrue Data Lake container - pyspark

Consider below scenario:
I want my data flow like following
import container ---> databricks (Transform)--> export container
Current situation after I am done with the transformation process
container:
---import
--folder
--mydata.csv
---export
--folder
--part-1-transformed-mydata.csv
--part-2-transformed-mydata.csv
--part-3-transformed-mydata.csv
--initial.txt
--success.txt
--finish.txt
I want below structure:
---import
--folder
--mydata.csv
---export
--folder
--transformed-mydata.csv
What should be preferred way (considering data is of few GBs <10) within data-bricks or I am happy to use any functionality in Data Factory as I am using this data-bricks notebook as a step in pipeline .
Note : I am using Apache Spark 3.0.0, Scala 2.12 in data-bricks with 14 GB Memory, 4 Cores. Cluster type is standard

you will either need to repartition the data into a single partition (note this defeats the point of using a distributed computing platform)
or after the files are generated simply run a command to concat them all into a single file this might be problematic if each file has a header you will need to account for that in your logic.
it might be better to think of the export folder as the "file" if that makes sense. doesn't solve your problem but unless you need to produce a single file for some reason most consumer wont have an issue reading the data in a directory

Related

Always read latest folder from s3 bucket in spark

Below is how my s3 bucket folder structure looks like,
s3://s3bucket/folder1/morefolders/$folder_which_I_want_to_pick_latest/
$folder_which_I_want_to_pick_latest - This folder can always have an incrementing number for every new folder that comes in, like randomnumber_timestamp
Is there a way I can automate this process by always reading the most recent folder in s3 from spark in Scala
The best way to work with that kind of "behavior" is structure your data as a partitioned approach, like year=2020/month=02/day=12, where, every partition is a folder (in aws-console). In this way you can use a simple filter on spark to determine the latest one. (more info: https://www.datio.com/iaas/understanding-the-data-partitioning-technique/)
However, if you are not allowed to re-structure your bucket, the solution could be costly if you don't have a specific identifier and/or reference that you can use to calculate your newest folder. Remember, that in s3 you don't have a concept of folder, you have only an object key (here is where you see the / and in aws console can be visualized as folders), so, to calculate the highest incremental id in $folder_which_I_want_to_pick_latest will eventually check in all the objects stored in the bucket and every object-request in s3 costs. More info: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html.
Here's one option. Consider writing a Lambda function that either runs on a schedule (say if you knew that your uploads always happen between 1pm and 4pm) or is triggered by an S3 object upload (so it happens for every object uploaded to folder1/morefolders/).
The Lambda would write the relevant part(s) of the S3 object prefix into a simple DynamoDB table. The client that needs to know the latest prefix would read it from DynamoDB.

Use of spark to optimize S3 to S3 transfer

I am learning spark/scala and trying to experiment with the below scenario using scala language.
Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder.
Things done so far:
1) Use AWS S3 SDK and scala:
- Create list of files from S3 source locations.
- Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured).
This works.
However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files?
The approach that I am thinking is:
1) Use S3 SDK to get the source paths similar to what's explained above
2) Create an RDD for each of the files using sc.parallelize() - something on these lines?
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work?
I am not sure how to achieve this as of now.
Can you please help me understand if I am thinking in the right direction and also is this approach correct?
Thanks
I think AWS did not make it complicated though.
We had the same problem, we transferred around 2TB close to 10 mins.
If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself.
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
AWS CLI Command Example:
aws s3 sync s3://sourcebucket s3://destinationbucket
If you want to do it programmatically you can use all SDK's to invoke the same type of command. I would avoid reinventing the same wheel.
Hope it helps.
I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation.
But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. Print out counters to help profile...

How to automatically edit over 100k files on GCS using Dataflow?

I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?
You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.

spark save simple string to text file

I have a spark job that needs to store the last time it ran to a text file.
This has to work both on HDFS but also on local fs (for testing).
However it seems that this is not at all so straight forward as it seems.
I have been trying with deleting the dir and getting "can't delete" error messages.
Trying to store a simple sting value into a dataframe to parquet and back again.
this is all so convoluted that it made me take a step back.
What's the best way to just store a string (timestamp of last execution in my case) to a file by overwriting it?
EDIT:
The nasty way I use it now is as follows:
sqlc.read.parquet(lastExecution).map(t => "" + t(0)).collect()(0)
and
sc.parallelize(List(lastExecution)).repartition(1).toDF().write.mode(SaveMode.Overwrite).save(tsDir)
This sounds like storing simple application/execution metadata. As such, saving a text file shouldn't need to be done by "Spark" (ie, it shouldn't be done in distributed spark jobs, by workers).
The ideal place for you to put it is in your driver code, typically after constructing your RDDs. That being said, you wouldn't be using the Spark API to do this, you'd rather be doing something as trivial as using a writer or a file output stream. The only catch here is how you'll read it back. Assuming that your driver program runs on the same computer, there shouldn't be a problem.
If this value is to be read by workers in future jobs (which is possibly why you want it in hdfs), and you don't want to use the Hadoop API directly, then you will have to ensure that you have only one partition so that you don't end up with multiple files with the trivial value. This, however, cannot be said for the local storage (it gets stored on the machine where the worker executing the task is running), managing this will simply be going overboard.
My best option would be to use the driver program and create the file on the machine running the driver (assuming it is the same that will be used next time), or, even better, to put it in a database. If this value is needed in jobs, then the driver can simply pass it through.

How to make MapReduce work with HDFS

This might sound like some stupid question.
I might write a MR code that can take input and output as HDFS locations and then I really don't need to worry about the parallel computing power of hadoop/MR. (Please correct me if I am wrong here).
However if my input is not an HDFS location say I am taking a MongoDB data as input - mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. I mean how can I be sure that the 1 GB or any sized big file is first being distributed on HDFS and then parallel computing is being done on it?
Is it that this direct URI will not distribute the data and I need to take the BSON file instead, load it up on HDFS and then give the HDFS path as Input to MR or the framework is smart enough to do this by itself?
I am sorry if the above question is too stupid or not making any sense at all. I am really new to big data but very much excited to dive into this domain.
Thanks.
You are describing DBInputFormat. This is an input format that reads the split from an external database. HDFS only gets involved in setting up the job, but not in actual input. There is also an DBOutputFormat. With an input like DBInputFormat the splits are logical, eg. key ranges.
Read Database Access with Apache Hadoop for a detailed explanation.
Sorry,I am not sure about MongoDb.
If you just wanted to know,how splitting is happening if we are using the data source is a table,then this is my answer when MapRed working with HBase.
we will use TableInputFormat to use an Hbase table in MapRed job.
From the http://hbase.apache.org/book.html#hbase.mapreduce.classpath
7.7. Map-Task Splitting
7.7.1. The Default HBase MapReduce Splitter
When TableInputFormat is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
7.7.2. Custom Splitters
For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.
This is a good question, not stupid.
1.
"mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. "
Under this situation, u needn't consider hdfs. U needn't do anything related with hdf. Just like write a multiple-thread application with each thread write data to mongodb.
In fact, hdfs is independent to map-reduce, and map-reduce is also independent to hdfs. So, u can use them separately or together as your wish.
2.
if u want to input/output db to map-reduce, u show consider DBInputFormat, but that's another question.
Now, hadoop DBInputFormat only support JDBC. I'm not sure whether some mongodb version of DBInputFormat. Maybe U can search it or implement it by yourself.