Near real time streaming data from 100s customer to Google Pub/Sub to GCS - google-cloud-storage

I am getting near-real time data from 100s of customers. I need to store this data in Google Cloud Storage buckets created for each customer i.e. /gcs/customer_id/yy/mm/day/hhhh/
My data is in Avro. I guess I can use Pub/Sub to Avro Files on Cloud Storage template.
However, I'm not sure if Google Pub/Sub can accept data from multiple customers.
Appreciate any help here, thanks!

The template is quite simple: it takes all the data of PubSub and store them in an avro file on GCS.
However, it's a good starting point and you can make evolutions on that base to add a split per customer, and the file path that you want.
You can find the template in Java format on GitHub

Related

CSV file storage vs Postgres database for multiple users

I have an dash app that features live-updates on CSV data uploaded to a private repository on Github at 5 minutes intervals. Currently the DASH app reads the entire CSV file at each update and populates the graphs. The app is hosted on heroku. I'm concerned about scalability of the app, as multiple users using the app will send API requests for CSV data of perhaps 30-70 rows of data potentially requiring a lot of dynos. Would hosting the data on Postgres prove advantages in any way? Currently, each new user has a separate folder in the github repository which makes API requests simple. Thank you. Note: I'm new to SQL, please be gentle.

How do I load a Google Cloud Storage Firebase export into BigQuery?

I have a simple, single collection in Firestore, a museum visitor name and date with some other fields.
I successfully ran a gcloud export:
gcloud beta firestore export gs://climatemuseumexhibition2019gov.appspot.com --collection-ids=visitors
and the collection is now sitting in a bucket in Cloud Storage.
I tried downloading the data, which appears to be in several chunks, but it's in .dms format and I have no idea what that is.
Also, I've successfully linked BigQuery to my firestore project but don't see the collection, and I don't see the Cloud Storage object at all.
I gather the idea is create a Dataset from the Cloud Storage, then create a Table from the dataset. I'd appreciate specific details on how to do that.
I've read the manuals and they are opaque on this topic, so I'd appreciate some first hand experience. Thank you.

Using Google Cloud Storage with Google Data Prep

I am using Google Cloud Storage to store CSV files. These CSV files get updated daily with new data in them. I'm hoping to use Google Data Prep to automate the process of cleaning these files and then combining them. Before I start to build this process, I am curious if this is a good way to use this platform. The CSV files will be in the same format each time. Are there any cause for concern if the files get updated on a daily basis? Or possible errors that could arise that I don't know about?
This is a great use case for Google Cloud Dataprep. You can parameterize your inputs. See https://cloud.google.com/dataprep/docs/html/Overview-of-Parameterization_118228665 and https://cloud.google.com/dataprep/docs/html/Create-Dataset-with-Parameters_118228628

Is there any way to call Bing-ads api through a pipeline and load the data into Bigquery through Google Data Fusion?

I'm creating a pipeline in Google Data Fusion that allows me to export my bing-ads data into Bigquery using my bing-ads developer token. I couldn't find any data sources that should be added to my pipeline in data fusion. Is fetching data from API calls even supported on Google Data Fusion and if it is, how can it be done?
HTTP based sources for Cloud Data Fusion are currently in development and will be released by Q3. Could you elaborate on your use case a little more, so we can make sure that your requirements will be covered by those plugins? For example, are you looking to build a batch or real-time pipeline?
In the meantime, you have the following two, more immediate options/workarounds:
If you are ok with storing the data in a staging area in GCS before loading it into BigQuery, you can use the HTTPToHDFS plugin that is available in the Hub. Use a path that starts with gs:///path/to/file
Alternatively, we also welcome contributions, so you can also build the plugin using the Cloud Data Fusion APIs. We are happy to guide you, and can point you to documentation and samples.

uploading images to php app on GCE and storing them onto GCS

I have a php app running on several instances of Google Compute Engine (GCE). The app allows users to upload images of various sizes, resizes the images and then stores the resized images (and their thumbnails) in the storage disk and their meta data in the database.
What I've been trying to find is a method for storing the images onto Google Cloud Storage (GCS) through the php app running on GCE instances. A similar question was asked here but no clear answer was given there. Any hints or guidance on the best way for achieving this is highly appreciated.
You have several options, all with pros and cons.
Your first decision is how users upload data to your service. You might choose to have customers upload their initial data to Google Cloud Storage, where your app would then fetch it and transform it, or you could choose to have them upload it directly to your service. Let's assume you choose the second option, and you want users to stream data directly to your service.
Your service then transforms the data into a different size. Great. You now have a new file. If this was video, you might care about streaming the data to Google Cloud Storage as you encode it, but for images, let's assume you want to process the whole thing locally and then store it in GCS afterwards.
Now we have to get a file into GCS. It's a PHP app, and so as you have identified, your main three options are:
Invoke the GCS JSON API through the Google API PHP client.
Invoke either the GCS XML or JSON API via custom code.
Use gsutil.
Using gsutil will be the easiest solution here. On GCE, it automatically picks up appropriate credentials for your service account, and it's got several useful performance optimizations and tuning that a raw use of the API might not do without extra work (for example, multithreaded uploads). Plus it's already installed on your GCE instances.
The upside of the PHP API is that it's in-process and offers more fine-grained, programmatic control. As your logic gets more complicated, you may eventually prefer this approach. Getting it to perform as well as gsutil may take some extra work, though.
This choice is comparable to copying files via SCP with the "scp" command line application or by using the libssh2 library.
tl;dr; Using gsutil is a good idea unless you have a need to handle interactions with GCS more directly.