I am studying for the Professional Data Engineer and I wonder what is the "Google recommended best practice" for hot data on Dataproc (given that costs are no concern)?
If cost is a concern then I found a recommendation to have all data in Cloud Storage because it is cheaper.
Can a mechanism be set up, such that all data is on Cloud Storage and recent data is cached on HDFS automatically? Something like AWS does with FSx/Lustre and S3.
What to store in HDFS and what to store in GCS is a case-dependant question. Dataproc supports running hadoop or spark jobs on GCS with GCS connector, which makes Cloud Storage HDFS compatible without performance losses.
Cloud Storage connector is installed by default on all Dataproc cluster nodes and it's available on both Spark and PySpark environments.
After researching a bit: the performance of HDFS and Cloud Storage (or any other blog store) is not completely equivalent. For instance a "mv" operation in a blob store is emulated as copy + delete.
What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.
Source: https://cwiki.apache.org/confluence/display/HADOOP2/HCFS
Related
How do I access or view the local storage for druid? I would like to view the segments or copy the segments to a file. I am running druid operator on kubernetes. I have tried exec commands for historicals pods and middle managers pod however I am unable to enter in to any of the druid pod
Have you tried looking where the deep storage says
Deep storage is where segments are stored. It is a storage mechanism
that Apache Druid does not provide. This deep storage infrastructure
defines the level of durability of your data, as long as Druid
processes can see this storage infrastructure and get at the segments
stored on it, you will not lose data no matter how many Druid nodes
you lose. If segments disappear from this storage layer, then you will
lose whatever data those segments represented.
Source: Deep Storage on Druid documentation
For example, you have to know what directory is pointed in: druid.storage.storageDirectory
Remember that the data is saved in segments as we can read here: Segments on Apache Druid documentation
Useful Documentation:
Ingestion troubleshooting FAQ
HDFS as Deep-Storage: Druid is not storing the historical data on hdfs
Druid Setup with HDFS
Change Local Storage to S3 as deepstorage
I have a work requirement of reading binary data from sensors and produce parquet output results for Analytics.
For storage I have chosen s3 and Dynamodb.
For processing engine I’m confused on how to choose between AWS EMR or AWS Glue.
Data processing code base will be maintained in python coupled with Spark.
Please post your suggestion on choosing between AWS EMR or AWS Glue.
Using Glue / EMR depends on your use-case.
EMR is a managed cluster of servers and costs less than Glue, but it also requires more maintenance and set-up overhead. You can not only run Spark but also other frameworks on EMR like Flink.
Glue is serverless Spark / Python and really easy to use. It does not run on the latest Spark version and abstracts a lot of Spark away, in a good but also in a bad sense, that you can not set specific configurations very easily.
It's an opinion based question and now you have AWS EMR Serverless.
AWS Glue is 1) more managed and thus with restrictions, and 2) imho issues with crawling for schema changes to consider, 3) own interpretation of dataframes 4) and less run-time configuration and 5) less options for serverless scalability. There seems to a few bugs etc. that keep on popping up.
AWS EMR is 1) an AWS platform easy enough to configure, 2) with the AWS flavour of what they think the best way of running Spark is, 3) some limitations in terms of subsequently scaling down resources when using dynamic scaling out, 4) a platform that uses Spark so there will be a bigger pool of persons to hire, 5) allowing bootstrapping of software not standardly supplied, and selection of standard software, such as, say, HBase.
So, comparable to an extent. And divergent in other ways; AWS Glue is ETL/ELT, AWS EMR is that with more capabilities.
Is there a simple solution for point-in-time recovery of a Google Cloud Storage bucket (given that object versioning is enabled)? Something similar to S3 PIT Restore?
I have a webapp with data (Google Cloud SQL) and files (Google Cloud Storage), where I would like to be able to restore the state at a specific point in time. Cloud SQL offers this natively, and the recovery can even be done from the Cloud Console.
We're planning to migrate our software to run in kubernetes with auto scalling, this is our current infrastructure:
PHP and apache are running in Google Compute Engine n1-standard-4 (4 vCPUs, 15 GB memory)
MySql is running in Google Cloud SQL
Data files (csv, pdf) and the code are storing in a single SSD Persistent Disk
I found many posts that recomments to store the data file in the Google Cloud Storage and use the API to fetch the file and uploading to the bucket. We have very limited time so I decide to use NFS to share the data files over the pods, the problem is nfs speed is slow, it's around 100mb/s when I copying the file with pv, the result from iperf is 1.96 Gbits/sec.Do you know how to achieve the same result without implement the cloud storage? or increase the NFS speed?
Data files (csv, pdf) and the code are storing in a single SSD Persistent Disk
There's nothing stopping you from volume mounting an SSD into the Pod so you can continue to use an SSD. I can only speak to AWS terminology, but some EC2 instances come with "local" SSD hardware, and thus you would only need to use a nodeSelector to ensure your Pods were scheduled onto machines that had said local storage available.
Where you're going to run into problems is if you are currently just using one php+apache and thus just one SSD, but now you want to scale the application up and it requires that all php+apache have access to the same SSD. That's a classic distributed application architecture problem, and something kubernetes itself can't fix for you.
If you're willing to expend the effort, you can also try any one of the other distributed filesystems (Ceph, GlusterFS, etc) and see if they perform better for your situation. Then again, "We have very limited time" I guess pretty much means that's off the table.
Above reference architecture indicates the existence of Cloud Storage sink from Cloud Dataflow, however the Beam API which seems to be the current default Dataflow API has no Cloud Storage I/O connector listed.
Can anyone help clarify if there is one that exists, if not what is the alternative to bring data from Dataflow to Cloud Storage.
Beam does support writing/reading from GCS. You simply use the TextIO classes.
https://beam.apache.org/documentation/sdks/javadoc/0.2.0-incubating/org/apache/beam/sdk/io/TextIO.html
To read a PCollection from one or more text files, use TextIO.Read. You can instantiate a transform using TextIO.Read.from(String) to specify the path of the file(s) to read from (e.g., a local filename or filename pattern if running locally, or a Google Cloud Storage filename or filename pattern of the form "gs:///").
You can use TextIO, AvroIO or any other connector that reads from/writes to files to interact with GCS. Beam identifies any file path that starts with "gs://" to be for GCS. Beam does this using the pluggable FileSystem [1] interface.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/storage/GcsFileSystem.java