Data Factory Copy Activity Blob -> ADLS - azure-data-factory

I have files that accumulate in Blob Storage on Azure that are moved each hour to ADLS with data factory... there are around 1000 files per hour, and they are 10 to 60kb per file...
what is the best combination of:
"parallelCopies": ?
"cloudDataMovementUnits": ?
and also,
"concurrency": ?
to use?
currently i have all of these set to 10, and each hourly slice takes around 5 minutes, which seems slow?
could ADLS, or Blob be getting throttled, how can i tell?

There won't be a one solution fits all scenarios when it comes to optimizing a copy activity. However there few things you can checkout and find a balance. A lot of it depends on the pricing tiers / type of data being copied / type of source and sink.
I am pretty sure that you would have come across this article.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
this is a reference performance sheet, the values are definitely different depending on the pricing tiers of your source and destination items.
Parallel Copy :
This happens at the file level, so it is beneficial if your source files are big as it chunks the data (from the article)
Copy data between file-based stores Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Self-hosted Integration Runtime machine.
The default value is 4.
behavior of the copy is important. if it is set to mergeFile then parallel copy is not used.
Concurrency :
This is simply how many instances of the same activity you can run in parallel.
Other considerations :
Compression :
Codec
Level
Bottom line is that you can pick and choose the compression, faster compression will increase network traffic, slower will increase time consumed.
Region :
the location or region of that the data factory, source and destination might affect performance and specially the cost of the operation. having them in the same region might not be feasible all the time depending on your business requirement, but definitely something you can explore.
Specific to Blobs
https://learn.microsoft.com/en-us/azure/storage/common/storage-performance-checklist#blobs
this article gives you a good number of metrics to improve performance, however when using data factory i don't think there is much you can do at this level. You can use the application monitoring to check out throughput while your copy is going on.

Related

Google Cloud cloud storage operation costs

I am looking into using Google Cloud cloud storage buckets as a cheaper alternative to compute engine snapshots to store backups.
However, I am a bit confused about the costs per operation. Specifically the insert operation. If I understand the documentation correctly, it doesn't seem that it matters how large the file is that you want to insert is, it always counts as 1 operation.
So if I upload a single 20 TB file using one insert to a standard storage class bucket, wait 14 days, then retrieve it again, and all this within the same region, I practically only pay for storing it for 14 days?
Doesn't that mean that even the standard storage class bucket is a more cost effective option for storing backups compared to snapshots, as long as you can get your whole thing into a single file?
It's not fully accurate, and all depends on what cost for you.
First of all, the maximum size of an object in Cloud Storage is 5 TiB, so you can't store 1 file of 20Tb, but 4, at the end, it's the same principle.
The persistent disk snapshot is a very powerful feature:
The snapshot doesn't need CPUs to be done, compared to your solution.
The snapshot doesn't need network bandwidth to be done, compared to your solution.
The snapshot can be done anytime, on the fly.
The snapshot can be restored in the current VM, or you can create a new VM with a snapshot to investigate on it, for example.
You can perform incremental snapshots saving money (cheaper than full image snapshot).
You don't need additional space on your persistent disk to be done (compared to your solution where you need to create an archive before sending it to Cloud Storage).
In your scenario seems like using snapshots seems like the best solution in terms of time efficiency. Now, is using Cloud Storage a cheaper solution? Probably, as it is listed as the most affordable storage option, but in the end, you will have to calculate the cost-benefits on your own.

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Why Parquet over some RDBMS like Postgres

I'm working to build a data architecture for my company. A simple ETL with internal and external data with the aim to build static dashboard and other to search trend.
I try to think about every step of the ETL process one by one and now I'm questioning about the Load part.
I plan to use Spark (LocalExcecutor on dev and a service on Azure for production) so I started to think about using Parquet into a Blob service. I know all the advantage of Parquet over CSV or other storage format and I really love this piece of technology. Most of the articles I read about Spark finish with a df.write.parquet(...).
But I cannot figure it out why can I just start a Postgres and save everything here. I understand that we are not producing 100Go per day of data but I want to build something future proof in a fast growing company that gonna produce exponentially data by the business and by the logs and metrics we start recording more and more.
Any pros/cons by more experienced dev ?
EDIT : What also make me questioning this is this tweet : https://twitter.com/markmadsen/status/1044360179213651968
The main trade-off is one of cost and transactional semantics.
Using a DBMS means you can load data transactionally. You also pay for both storage and compute on an on-going basis. The storage costs for the same amount of data are going to be more expensive in a managed DBMS vs a blob store.
It is also harder to scale out processing on a DBMS (it appears the largest size Postgres server Azure offers has 64 vcpus). By storing data into an RDBMs you are likely going to run-up against IO or compute bottlenecks more quickly then you would with Spark + blob storage. However, for many datasets this might not be an issue and as the tweet points out if you can accomplish everything inside a the DB with SQL then it is a much simpler architecture.
If you store Parquet files on a blob-store, updating existing data is difficult without regenerating a large segment of your data (and I don't know the details of Azure but generally can't be done transactionally). The compute costs are separate from the storage costs.
Storing data in Hadoop using raw file formats is terribly inefficient. Parquet is a Row Columnar file format well suited for querying large amounts of data in quick time. As you said above, writing data to Parquet from Spark is pretty easy. Also writing data using a distributed processing engine (Spark) to a distributed file system (Parquet+HDFS) makes the entire flow seamless. This architecture is well suited for OLAP type data.
Postgres on the other hand is a relational database. While it is good for storing and analyzing transactional data, it cannot be scaled horizontally as easily as HDFS can be. Hence when writing/querying large amount of data from Spark to/on Postgres, the database can become a bottleneck. But if the data you are processing is OLTP type, then you can consider this architecture.
Hope this helps
One of the issues I have with a dedicated Postgres server is that it's a fixed resource that's on 24/7. If it's idle for 22 hours per day and under heavy load 2 hours per day (in particular if the hours aren't
continuous and are unpredictable)
then the server sizing during those 2 hours is going to be too low whereas during the other 22 hours it's too high.
If you store your data as parquet on Azure Data Lake Gen 2 and then use Serverless Synapse for SQL queries then you don't pay for anything on a 24/7 basis. When under heavy load, everything scales automatically.
The other benefit is that parquet files are compressed whereas Postgres doesn't store data compressed.
The downfall is "latency" (probably not the right term but it's how I think of it). If you want to query a small amount of data then, in my experience, it's slower with the file + Serverless approach compared to a well indexed clustered or partitioned Postgres table. Additionally, it's really hard to forecast your bill with the Serverless model coming from the server model. There's definitely going to be usage patterns where Serverless is going to be more expensive than a dedicated server. In particular if you do a lot of queries that have to read all or most of the data.
It's easier/faster to save a parquet than to do a lot of inserts. This is a double edged sword because the db guarantees acidity whereas saving parquet files doesn't.
Parquet storage optimization is its own task. Postgres has autovacuum. If the data you're consuming is published daily but you want it on a node/attribute/feature partition scheme then you need to do that manually (perhaps with spark pools).

Google Cloud Storage cost effectiveness for small files?

I have a lot of small unstructured json files (less than 1K each) I want to store on Google cloud storage somehow (using streaming). I would prefer to avoid putting them into zip files (I think) since I'm thinking of using Apache Drill to perform queries against them. Would it be more cost effective to merge multiple json documents together rather than storing them one by one? (I assume that writing the files in batches would be a good thing regardless if they're merged or stored separately)
Well...maybe. It depends on your usage pattern.
GCS does not have a per-object charge. Instead, it charges per Gigabyte stored per month. Breaking the files up won't affect that at all.
However, GCS also charges a per-operation fee. At time of writing, every 10,000 downloads will cost you a penny, and every 10,000 uploads will cost you a dime. If you only have a few thousand files or only access a few files at a time, this might not make a big difference, but if you need to download all of the files frequently, or if you need to replace them frequently, and you're doing millions or billions of separate uploads per day, suddenly using a few big files instead could save you a lot of money.
If you can estimate how many downloads and uploads you'll be doing under each scenario, Google provides a calculator to let you know what it will cost: https://cloud.google.com/products/calculator/

Getting large rows out of SQL Azure - but where to go? Tables, Blob or something like MongoDB?

I read through a lot of comparisons between Azure Table/Blob/SQL storage and I think I have a good understanding of all of those ... but still, I'm unsure where to go for my specific needs. Maybe someone with experience in similar scenarios and is able to make a recommendation.
What I have
A SQL Azure DB that stores articles in raw HTML inside a varchar(max) column. Each row also has many metadata columns and many indexes for easy querying. The table contains many references to Users, Subscriptions, Tags and more - so a SQL DB will always be needed for my project.
What's the problem
I already have about 500,000 articles in this table and I expect it to grow by millions of articles per year. Each article's HTML content can be anywhere between a few KB and 1 MB or, in very few cases, larger than 1 MB.
Two problems arise: as Azure SQL storage is expensive, rather earlier than later I'll shoot myself in the head with the costs for storing this. Also, I will hit the 150 GB DB size limit also rather earlier than later. Those 500,000 articles already consume 1,6 GB DB space now.
What I want
It's clear those HTML content has to get out of the SQL DB. While the article table itself has to remain for joining it to users, subscriptions, tags and more for fast relational discovery of the needed articles, at least the colum that holds the HTML content could be outsourced to a cheaper storage.
At first sight, Azure Table storage seems like the perfect fit
Terabytes of data in one large table for very cheap prices and fast queries - sounds perfect to have a singe Table Storage table holding the article contents as an add-on to the SQL DB.
But reading through comparisons here shows it might not even be an option: 64 KB per column would be enough for 98 % of my articles, but there are those 2 % left where for some single articles even the whole 1 MB of the row limit might not be enough.
Blob storage sounds completely wrong, but ...
So there's just one option on Azure left: Blobs. Now, it might not be as wrong as it sounds. In most of the cases, I would need the content of only a single article at once. This should work fine and fast enough with Blob storage.
But I also have queries where I would need 50, 100 or even more rows at once INCLUDING even the content. So I would have to run the SQL query to fetch the needed articles and then fetch every single article out of the Blob storage. I have no experience with that but I can't believe I'd be able to remain in millisecond timespan for the queries when doing that. And queries that take multiple seconds are an absolute no-go for my project.
So it also does not seem to be to be an appropriate solution.
Do I look like a guy with a plan?
At least I have something like a plan. I thought about only "exporting" appropriate records into SQL Table Storage and/or Blob Storage.
Something like "as long as the content is < 64 KB export it to table storage, else keep it in the SQL table (or even export this single XL record into BLOB storage)"
That might work good enough. But it makes things complicated and maybe unnecessary error-prone.
Those other options
There are some other NoSQL DBs like MongoDB and CouchDB that seem to better fit my needs (at least from my naive point of view as someone who just read the specs on paper, I don't have experience with them). But they'd require self-hosting, some thing I'd like to get out of it's way if possible. I'm on Azure to do as little as needed in terms of self-hosting servers and services.
Did you really read until here?
Then thank you very much for your valuable time and thinking about my problems :)
Any suggestions would be greatly appreciated. As you see, I have my ideas and plans, but nothing beats experience from someone who walked down the road before :)
Thanks,
Bernhard
I signed up just solely to help with this question. In the past, I have found useful answers to my problems from Stackoverflow - thank you community - so I thought it would just be fair (perhaps fair is an understatement) to attempt to give something back with this question, as it falls on my alley.
In short, while considering all factors stated in the question, table storage may be the best option - iif you can properly estimate transactions per month: a nice article on this.
You can solve the two limitations that you mentioned, row and column limit, by splitting (plain text method or serializing it) the document/html/data. Speaking from experience with 40 GB+ data stored in Table Storage, where frequently our app retrieves more than 10 rows per each page visit in milliseconds - no argument here! If you need 50+ rows at times, you are looking at low single digits second(s), or you can do them in parallel (and further by splitting the data in different partitions), or in some async fashion. Or, read suggested multi level caching below.
A bit more detail. I tried with SQL Azure, Blob (both page and block), and Table Storage. I can not speak for Mongo DB since, partially for the reasons already mentioned here, I did not want to go that route.
Table Storage is fast; in the range of 20-50 milliseconds, or even faster sometimes (depends, for instance in the same data center i have seen it gone as low as 10 milliseconds), when querying with partition and row key. You may also further have several partitions, in some fashion based on your data and your knowledge about it.
It scales better, in terms of GB's but not transactions
Row and column limitations that you mentioned are a burden, agreed, but not a show stopper. I have written my own solution to split entities, you can too easily, or you can see this already-written-solution (does not solve the whole problem but it is a good start): https://code.google.com/p/lokad-cloud/wiki/FatEntities
Also need to keep in mind that uploading data to table storage is time consuming, even when batching entities due to other limitations (i.e., request size less than 4 MB, upload bandwidth, etc).
But using solely just TableStorage may not be the best solution (thinking about growth and economics). The best solution that we ended up implementing used multi-level caching/storage, starting from static classes, Azure Role Based Cache, Table Storage, and Block Blobs. Lets call this, for readability purposes, level 1A, 1B, 2 and 3 respectively. Using this approach, we are using a medium single instance (2 CPU Cores and 3.5 GB Ram - my laptop has better performance), and are able to process/query/rank 100GB+ of data in seconds (95% of cases in under 1 second). I believe this is fairly impressive given that we check all "articles" before displaying them (4+ million "articles").
First, this is tricky and may or may not be possible in your case. I do not have sufficient knowledge about the data and its query/processing usage, but if you can find a way to organize the data well this may be ideal. I will make an assumption: it sounds like you are trying to search through and find relevant articles given some information about a user and some tags (a variant of a news aggregator perhaps, just got a hunch for that). This assumption is made for the sake of illustrating the suggestion, so even if not correct, I hope it will help you or trigger new ideas on how this could be adopted.
Level 1A data.
Identify and add key entities or its properties in a static class (periodically, depending on how you foresee updates). Say we identify user preferences (e.g., demographics and interest, etc) and tags (tech, politics, sports, etc). This will be used to retrieve quickly who the user is, his/her preferences, and any tags. Think of these as key/value pair; for instance key being a tag, and its value being a list of article IDs, or a range of it. This solves a small piece of a problem, and that is: given a set of keys (user pref, tags, etc) what articles are we interested in! This data should be small in size, if organized properly (e.g., instead of storing article path, you can only store a number). *Note: the problem with data persistence in a static class is that application pool in Azure, by default, resets every 20 minutes or so of inactivity, thus your data in the static class is not persistent any longer - also sharing them across instances (if you have more than 1) can become a burden. Welcome level 1B to the rescue.
Leval 1B data
A solution we used, is to keep layer 1A data in a Azure Cache, for its sole purpose to re-populate the static entity when and if needed. Level 1B data solves this problem. Also, if you face issues with application pool reset timing, you can change that programmatically. So level 1A and 1B have the same data, but one is faster than the other (close enough analogy: CPU Cache and RAM).
Discussing level 1A and 1B a bit
One may point out that it is an overkill to use a static class and cache, since it uses more memory. But, the problem we found in practice, is that, first it is faster with static. Second, in cache there are some limitations (ie., 8 MB per object). With big data, that is a small limit. By keeping data in a static class one can have larger than 8 MB objects, and store them in cache by splitting them (i.e., currently we have over 40 splits). BTW please vote to increase this limit in the next release of azure, thank you! Here is the link: www.mygreatwindowsazureidea.com/forums/34192-windows-azure-feature-voting/suggestions/3223557-azure-preview-cache-increase-max-item-size
Level 2 data
Once we get the values from the key/value entity (level 1A), we use the value to retrieve the data in Table Storage. The value should tell you what partition and Row Key you need. Problem being solved here: you only query those rows relevant to the user/search context. As you can see now, having level 1A data is to minimize row querying from table storage.
Level 3 data
Table storage data can hold a summary of your articles, or the first paragraph, or something of that nature. When it is needed to show the whole article, you will get it from Blob. Table storage, should also have a column that uniquely identifies the full article in blob. In blob you may organize the data in the following manner:
Split each article in separate files.
Group n articles in one file.
Group all articles in one file (not recommended although not as bad as the first impression one may get).
For the 1st option you would store, in table storage, the path of the article, then just grab it directly from Blob. Because of the above levels, you should need to read only a few full articles here.
For the 2nd and 3rd option you would store, in table storage, the path of the file and the start and end position from where to read and where to stop reading, using seek.
Here is a sample code in C#:
YourBlobClientWithReferenceToTheFile.Seek(TableStorageData.start, SeekOrigin.Begin);
int numBytesToRead = (int)TableStorageData.end - (int)TableStorageData.start;
int numBytesRead = 0;
while (numBytesToRead > 0)
{
int n = YourBlobClientWithReferenceToTheFile.Read(bytes,numBytesRead,numBytesToRead);
if (n == 0)
break;
numBytesRead += n;
numBytesToRead -= n;
}
I hope this didn't turn into a book, and hope it was helpful. Feel free to contact me if you have follow up questions or comments.
Thanks!
The proper storage for a file is a blob. But if your query needs to return dozens of blobs at the same time, it will be too slow as you are pointing out. So you could use a hybrid approach: use Azure Tables for 98% of your data, and if it's too large, use a Blob instead and store the Blob URI in your table.
Also, are you compressing your content at all? I sure would.
My thoughts on this: Going the MongoDB (or CouchDB) route is going to end up costing you extra Compute, as you'll need to run a few servers (for high availability). And depending on performance needed, you may end up running 2- or 4-core boxes. Three 4-core boxes is going to run more than your SQL DB costs (plus then there's the cost of storage, and MongoDB etc. will back their data in an Azure blob for duable storage).
Now, as for storing your html in blobs: this is a very common pattern, to offload large objects to blob storage. The GETs should be doable in a single call to blob storage (single transaction) especially with the file size range you mentioned. And you don't have to retrieve each blob serially; you can take advantage of TPL to download several blobs to your role instance in parallel.
One more thing: How are you using the content? If you're streaming it from your role instances, then what I said about TPL should work nicely. If, on the other hand, you're injecting href's into your output page, you can just put the blob url directly into your html page. And if you're concerned about privacy, make the blobs private and generate a short-TTL "shared access signature" granting access for a small time window (this only applies if inserting blob url's into some other html page; it doesn't apply if you're downloading to the role instance and then doing something with it there).
You could use MongoDB's GridFS feature: http://docs.mongodb.org/manual/core/gridfs/
It splits the data into 256k chunks by default (configurable up to 16mb) and lets you use the sharded database as a filesystem which you can use to store and retrieve files. If the file is larger than the chunk size, the mongo db drivers handle splitting up / re-assembling the data when the file needs to be retrieved. To add additional disk space, simply add additional shards.
You should be aware, however that only some mongodb drivers support this and it is a driver convention and not a server feature that allows for this behavior.
A few comments:
What you could do is ALWAYS store HTML content in blob storage and store the blob's URL in table storage. I personally don't like the idea of storing data conditionally i.e. if content of HTML file is more than 64 KB only then store it in blob storage otherwise use table storage. Other advantage you get out of this approach is that you can still query the data. If you store everything in blob storage, you would lose querying capability.
As far as using other NoSQL stores are concerned, only problem I see with them is that they are not natively supported on Windows Azure thus you would be responsible for managing them as well.
Another option would be to store your files as a VHD image in blob storage. Your roles can mount the VHD to their filesystem, and read the data from there.
The complication seems to be that only one VM can have read/write access to the VHD. The others can create a snapshot and read from that, but they won't see updates. Depending on how frequently your data is updated that could work. eg, if you update data at well-known times you could have all the clients unmount, take a new snapshot, and remount to get the new data.
You can also share out a VHD using SMB sharing as described in this MSDN blog post. This would allow full read/write access, but might be a little less reliable and a bit more complex.
you don't say, but if you are not compressing your articles that probably solves your issue then just use table storage.
Otherwise just use table storage and use a unique partition key for each article. If an article's too big put it in 2 rows, as long as you query by partition key you'll get both rows, then use the row key as the index indicating how the articles fit back together
One idea that i have would be to use CDN to store your article content, and link them directly from the client side, instead of any multi phase, operation of getting data from sql then going to some storage.
It would be something like
http://<cdnurl>/<container>/<articleId>.html
Infact same thing can be done with Blob storage too.
The advantage here is that this becomes insanely fast.
Disadvantage here is that security aspect is lost.
Something like Shared Access Signature can be explored for security, but I am not sure how helpful would it be for client side links.