GCS slow upload from pod inside kubernetes GKE - google-cloud-storage

Uploading to GCE from a pod inside GKE takes really long. I hoped the upgrade to kubernetes 1.1 would help, but it didn't. It is faster, but not as fast as it should be. I made some benchmarks, uploading a single file with 100MiB:
docker 1.7.2 local
took {20m51s240ms}, that's about ~{0.07993605115907274}MB/s
docker 1.8.3 local
took {3m51s193ms}, that's about ~{0.4329004329004329}MB/s
docker 1.9.0 local
took {3m51s424ms}, that's about ~{0.4329004329004329}MB/s
kubernetes 1.0
took {1h10s952ms}, that's about ~{0.027700831024930747}MB/s
kubernetes 1.1.2 (docker 1.8.3)
took {32m11s359ms}, that's about ~{0.05178663904712584}MB/s
As you can see the thruput doubles with kubernetes 1.1.2, but is still really slow. If I want to upload 1GB I have to wait for ~5 hours, this can't be the expected behaviour. GKE runs inside the Google infrastructure, so I expect that it should be faster or at least as fast as uploading from local.
I also noted a very high CPU load (70%) while uploading. It was tested with a n1-highmem-4 machine-type and a single RC/pod that was doing nothing then the upload.
I'm using the java client with the GAV coordinates com.google.appengine.tools:appengine-gcs-client:0.5
The relevant code is as follows:
InputStream inputStream = ...; // 100MB RandomData from RAM
StorageObject so = new StorageObject().setContentType("text/plain").setName(objectName);
AbstractInputStreamContent content = new InputStreamContent("text/plain", inputStream);
Stopwatch watch = Stopwatch.createStarted();
storage.objects().insert(bucket.getName(), so, content).execute();
watch.stop();
Copying a 100MB file using a manually installed gcloud with gsutil cp took nearly no time (3 seconds). So it might be an issue with the java-library? The question still remains, how to improve the upload time using the java-library?

Solution is to enable "DirectUpload", so instead of writing
storage.objects().insert(bucket.getName(), so, content).execute();
you have to write:
Storage.Objects.Insert insert = storage.objects().insert(bucket.getName(), so, content);
insert.getMediaHttpUploader().setDirectUploadEnabled(true);
insert.execute();
Performance I get with this solution:
took {13s515ms}, that's about ~{7.6923076923076925}MB/s
JavaDoc for the setDirectUploadEnabled:
Sets whether direct media upload is enabled or disabled.
If value is set to true then a direct upload will be done where the
whole media content is uploaded in a single request. If value is set
to false then the upload uses the resumable media upload protocol to
upload in data chunks.
Direct upload is recommended if the content size falls below a certain
minimum limit. This is because there's minimum block write size for
some Google APIs, so if the resumable request fails in the space of
that first block, the client will have to restart from the beginning
anyway.
Defaults to false.

The fact that you're seeing high CPU load and that the slowness only affects Java and not the Python gsutil is consistent with the slow AES GCM issue in Java 8. The issue is fixed in Java 9 using appropriate specialized CPU instructions.
If you have control over it, then either using Java 7 or adding jdk.tls.disabledAlgorithms=SSLv3,GCM to a file passed to java -Djava.security.properties should fix the slowness as explained in this answer to the general slow AES GCM question.

Related

Aspera Node API /files/{id}/files endpoint not returning up to date data

I am working on a webapp for transferring files with Aspera. We are using AoC for the transfer server and an S3 bucket for storage.
When I upload a file to my s3 bucket using aspera connect everything appears to be successful, I see it in the bucket, and I see the new file in the directory when I run /files/browse on the parent folder.
I am refactoring my code to use the /files/{id}/files endpoint to list the directory because the documentation says it is faster compared to the /files/browse endpoint. After the upload is complete, when I run the /files/{id}/files GET request, the new file does not show up in the returned data right away. It only becomes available after a few minutes.
Is there some caching mechanism in place? I can't find anything about this in the documentation. When I make a transfer in the AoC dashboard everything updates right away.
Thanks,
Tim
Yes, the file-id base system uses an in-memory cache (redis).
This cache is updated when a new file is uploaded using Aspera. But for files movement directly on the storage, there is a daemon that will periodically scan and find new files.
If you want to bypass the cache, and have the API read the storage, you can add this header in the request:
X-Aspera-Cache-Control: no-cache
Another possibility is to trigger a scan by reading:
/files/{id}
for the folder id

What is the "[full path]" component of the SSL Certificate Authority given by MySQL and PostgreSQL (boto3) calls in the AWS docs?

In the AWS documentation for "Connecting to your DB instance using IAM authentication and the AWS SDK for Python (Boto3)", the following call is made to both psycopg2.connect (shown) and mysql.connector.connect:
conn = psycopg2.connect(host=ENDPOINT, port=PORT, database=DBNAME, user=USR, password=token, sslmode='prefer', sslrootcert="[full path]rds-combined-ca-bundle.pem")
cur = conn.cursor()
cur.execute("""SELECT now()""")
query_results = cur.fetchall()
print(query_results)
I see some discussion about the ssl_ca path (here and here) and what those bundles are used for. But none of the three links I've given here describe the [full path] component given by the AWS docs, or where it is pointing to. My current guess (from the second link) is this URL, but I'd like to be sure.
Additionally, what are the advantages to having this bundle downloaded to the remote EC2 on which these Python 3 (boto3) scripts are running?
EDIT: By the way, the above call to psycopg2.connect is working in Jupyter with Python 3.9.5 on an EC2 currently, with the [full path] written as-is...
You should replace the '[full path]' with the filesystem path (directory path) to where you saved the pem file when you downloaded it (from that last URL you gave) to the local computer.
The advantage of using it is that your client will verify it connected to the correct database, and not some malicious system which is intercepting your traffic. I don't how advantageous you consider this: if someone has compromised Amazon enough to be intercepting their internal traffic, they might also have compromised their CA as well. But there is at least some possibility they did one without the other.
Your code as shown does not work for me, because ssl_ca is not how it is spelled. Assuming you used the code actually given at your first link for PostgreSQL:
sslmode='prefer', sslrootcert="[full path]rds-combined-ca-bundle.pem"
Then the reason it works despite the bogus path is that "prefer" means it doesn't care if the rootcert is missing, it just skips validating in that case. If you change it to 'verify-full', then presumably it would stop working.

IBM Aspera get size of file before download

I am using Aspera Connect on mac to download files from a server. It works fine in terminal, but i was wondering if before i download a file, i could read its size first and then decide if i want to download it or not. I found the flag
'--precalculate-job-size'
but it's only doing that right before download and there's no way to stop the download.
The current command i use is this:
/Applications/Aspera\ Connect.app/Contents/Resources/./ascp -QT -l 200M -P33001 -i "/Applications/Aspera Connect.app/Contents/Resources/asperaweb_id_dsa.openssh" emp_ext3#fasp.ebi.ac.uk:/{asp_path} {local_path}
The resources for the flags are here:
https://download.asperasoft.com/download/docs/ascp/2.7/html/index.html
To answer your question, without going too much in the details:
If you want to display the size of an elements on an Aspera server for which you have access, you can use the command line "Amelia", see:
https://www.rubydoc.info/gems/asperalm
mlia server --url=ssh://fasp.ebi.ac.uk:33001 --username=emp_ext3 --ssh-keys=~/.aspera/mlia/aspera_bypass_dsa.pem br /10002/data/100_movie_gc.mrcs
there are plenty of options, like : --format=csv --fields=size
Note that this displays individual file sizes, but not recursive folder size.
a few other things:
You are not exactly using "Connect", but rather the "ascp" command line. Connect refers rather to the browser extension and lightweight app. while ascp is the implementation of Aspera FASP transfer protocol, found basically in all Aspera products.
the latest ascp documentation can be found here: https://www.ibm.com/support/knowledgecenter/SSL85S_3.9.6/hsts_admin_linux/dita/hsts_admin_linux_ascp_usage.html
did you know you can also use the free client:
https://downloads.asperasoft.com/en/downloads/2
it includes also ascp, but also a graphical user interface

PlayFramework 2.2.6: Advanced HTTP server configuration maxInitialLineLength

We are trying to send GET and POST requests with a length greater than 4096 bytes to our REST API implemented with Playframework 2.2.6.
After a long google research we tried nearly everything and the solution seems to be passing the following two arguments when starting our server via play. We receive no error message about wrong parameters but when we send a large request to the api we still receive the error
TooLongFrameException: An HTTP line is larger than 4096 Bytes
We are running the server by the following command
<PathToPlay>\play-2.2.6\play.bat -org.jboss.netty.maxHeaderSize:102400 -org.jboss.netty.maxInitialLineLength:102400 run
First of all your path to start your application seems off. When you create a new play application a play.bat or activator.bat file is automatically created in your project root folder. So no need to call a specific play installation runtime outside your project folder.
The parameters for setting the max body and header length can be found in the play documentation.
http.netty.maxInitialLineLength
- The maximum length for the initial line of an HTTP request, defaults to 4096
http.netty.maxHeaderSize
- The maximum size for the entire HTTP header, defaults to 8192
Development Mode
To start your application in development mode call
/path/to/project/play run -Dhttp.netty.maxInitialLineLength=102400 -Dhttp.netty.maxHeaderSize=102400
If you've used Activator to create your project replace play with activator.
Production mode
After you've published your application for production with play dist you can set the parameters by calling
/path/to/publishedApp/bin/<nameOfApp> -Dhttp.netty.maxInitialLineLength=102400 -Dhttp.netty.maxHeaderSize=102400

What is best approach to make large static binary available available through HTTP endpoint in Go over Google App Cloud?

Due to size of file repeatedly hitting deadline error (https://www.shiftedup.com/2015/03/12/deadline-errors-60-seconds-or-less-in-google-app-engine ) and cannot host these 3 binary files ( available on 3 endpoints ) over CDN.
App Engine has two limits: 60 seconds and 32MB max per request. If you need to serve large files, you need to use Google Cloud Storage which supports files up to 5GB (June 2016). You can keep these files private and serve directly from the bucket to your client using a signed URL.