I am trying to upload a file around 7 GB to google cloud storage. I used the HttpRequest class to upload. I choosed "resumable" upload. I also set the readtimeout to be 20000000.
If I upload smaller file, if works fine. For bigger file, such as 6GB. It returns:
Exception in thread "main" com.google.api.client.http.HttpResponseException: 400 Bad Request
Request is too large.
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1050)
Is there any one who successfully uploaded a larger file around 10GB to Google Cloud Storage?
How did you make it?
Thanks a lot!
Even though you have enabled 'resumable upload', HttpRequest.execute() is attempting to upload the full file during a single request.
You have to create separate HttpRequests, each sending a 256KB chunk.
Resumable Upload Documentation
Chunk size restriction: All chunks must be a multiple of 256 KB (256 x 1024 bytes) in size, except for the final chunk that completes the upload. If you use chunking, it is important to keep the chunk size as large as possible to keep the upload efficient.
Related
I am reading this article. It said "To demonstrate this, let’s take an 8MB file, and fetch it through different sized chunks. In the graph below, we can see that as the block size gets larger, performance improves. As the chunk size decreases, the overhead per-transaction increases, and performance slows down." in section "Set your optimal fetch size".
However, I try to check the Google Cloud Storage settings but there are no "block size", "chunk size" or "fetch size". Where is such a setting?
A part of the answer is in the answer, this parameter
GSUtil:sliced_object_download_max_components=8
It allows you to download in a number of slice. In the given example, it is used on a 8Mb file and therefore you can deduce the size of each chunk. The size of the chunks aren't important in the article, but the HTTP overhead incurs: the more chunks you have, the more HTTP request you get, and therefore, the more HTTP overhead you pay (in time manner)
I have uploaded a big file of more than 5 GB size but I want to ckeck this big file is counted as single put operation or it will counted as multiple put operations based on the number of small chunks GCP divided this file.
I am not able to find any such monitoring on GCP console. Kind help/guide me how to check the number of class A or class B operations performed so far in GCP.
I'm using a gcloud storage bucket mounted to a VM instance with gcsfuse. I have no problems opening files and reading them when the files are stored on the storage bucket, but when I try to write files to the storage bucket it is enormously slow and when I say 'enormously' I mean at least 10 times slower if not 100 times. Is it supposed to be that way? If so, I guess I'm going to have to write files to a persistent disk, then upload the files to the storage bucket, then download the files to my personal computer from the storage bucket. Although the process will take the same amount of time, at least the psychological demoralization will not occur.
From Documentation:
Performance: Cloud Storage FUSE has much higher latency than a local file system. As such, throughput may be reduced when reading or writing one small file at a time. Using larger files and/or transferring multiple files at a time will help to increase throughput.
Individual I/O streams run approximately as fast as gsutil.
The gsutil rsync command can be particularly affected by latency because it reads and writes one file at a time. Using the top-level -m flag with the command is often faster.
Small random reads are slow due to latency to first byte (don't run a database over Cloud Storage FUSE!)
Random writes are done by reading in the whole blob, editing it locally, and writing the whole modified blob back to Cloud Storage. Small writes to large files work as expected, but are slow and expensive.
Optionally, please check out the gsutil tool or GCS Client Libraries, or even Storage Transfer Service since they may suit your needs better depending on your specific use case.
I hope this clarifies your concerns.
So, I have around 35 GB of zip files, each one contains 15 csv files, I have created a scala script that processes each one of the zip files and each one of the csv files per each zip file.
The problem is that after some amount of files the script lunches this error
ERROR Executor: Exception in task 0.0 in stage 114.0 (TID 3145)
java.io.IOException: java.sql.BatchUpdateException: (Server=localhost/127.0.0.1[1528] Thread=pool-3-thread-63) XCL54.T : [0] insert of keys [7243901, 7243902,
And the string continues with all the keys (records) that were not inserted.
So what I have found is that apparently (I said apparently because of my lack of knowledge about scala and snappy and spark) the memory that is been used is full... my question... how do I increment the size of the memory used? or how do I empty the data that is in memory and save it in the disk?
Can I close the session started and that way free the memory?
I have had to restart the server, remove the files processed and then I can continue with the importation but after some other files... again... same exception
My csv files are big... the biggest one is around 1 GB but this exception happens not just with the big files but when accumulating multiple files... until some size is reached... so where do I change that memory use size?
I have 12GB RAM...
You can use RDD persistance and store to disk/memory or a combination : https://spark.apache.org/docs/2.1.0/programming-guide.html#rdd-persistence
Also, try adding a large number of partitions when reading the file(s): sc.textFile(path, 200000)
I think you are running out of available memory. The exception message is misleading. If you only have 12GB of memory on your machine, I wonder if your data would fit.
What I would do is first figure out how memory you need.
1. Copy conf/servers.template to conf/servers file
2) Change this file with something like this: localhost -heap-size=3g
-memory-size=6g //this essentially allocates 3g in your server for computations (spark, etc) and allocates 6g of off-heap memory for your data (column tables only).
3) start your cluster using snappy-start-all.sh
4) Load some subset of your data (I doubt you have enough memory)
5) Check the memory used in the SnappyData Pulse UI (localhost:5050)
if you think you have enough memory, load the full data.
Hopefully that works out.
BatchUpdateException tells me that you are creating Snappy tables and inserting data in them. Also, BatchUpdateException in most of the cases means low memory (exception message needs to be better). So, I believe you may be right about the memory. For freeing the memory, you will have to drop the tables that you created. For information about memory size and table sizing, you may want to read these docs:
http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#memory-management-heap-and-off-heap
http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#table-memory-requirements
Also if you have lot of data that can't fit in memory, you can overflow it to disk. See the following doc about the overflow configuration:
http://snappydatainc.github.io/snappydata/best_practices/design_schema/#overflow-configuration
Hope it helps.
Unfortunately I could not find in the specification (BEP 9).
Is it possible to download the torrent metadata from several peers?
Or it is restricted only one peer.
For example, the first chunk of the torrent file I download from one peer but the second chunk from the other peer?
Thank you in advance.
BEP 9 is here.
Yes, it is possible to download the metadata from several peers, assuming the .torrent file is greater than 16 kiB (the info-dictionary specifically). The info-dictionary is split up into 16 kiB blocks, and requested by specifying the index of the block you want. By requesting different blocks from different peers, you download it from multiple peers in parallel.