stackdriver gcloud log write throughput - gcloud

I am looking into gcloud log shell command line, I started with a classic sample:
gcloud beta logging write --payload-type=struct my-test-log "{\"message\": \"My second entry\", \"weather\": \"aaaaa\"}"
It works fine so I checked the throughputwith the following code its works veru slaw (about 2 records a sec) is this the best way to do so?
Here is my sample code
tail -F -q -n0 /root/logs/general/*.log | while read line
do
echo $line
b=`date`
gcloud beta logging write --payload-type=struct my-test-log "{\"message\": \"My second entryi $b\", \"weather\": \"aaaaa\"}"
done

If you assume each command execution takes around 150ms at best, you can only write a handful of entries every second. You can try using the API directly to send the entries in batches. Unfortunately, the command line can currently only write one entry at a time. We will look into adding the capability to write multiple entries at a time.
If you want to stream large number of messages fast, you may want to look into Pub/Sub.

Related

How to reliably run ksql command in Kafka

Using Confluent Kafka Cloud API endpoint I send HTTP request containing CREATE OR REPLACE STREAM command that is used to update already existent stream. Half of the times, however ksqldb returns 502 Error:
Could not write the statement 'CREATE OR REPLACE STREAM...
Caused by: Another command with the same id
It seems this is a well known issue happening when ksqlDB command fails to commit. It has been "recommended" to send the same command again and again "...as a subsequent retry of the command should work fine".
Another suggestion is to "to clean up the command from the futures map on a commit failure".
Here is a GitHub issue link: https://github.com/confluentinc/ksql/issues/6340
The first recommendation that talks about re-sending the same command multiple times in a row to finally make it work, does not seem to be a good solution. I wonder if the second recommendation to clean up the command from the future map is a better choice?
How do we clean up the command from the future map? Is there any other solution known to make ksqldb reliably execute the ksql command?

Set iteration count

I'm building a locust script to be integrated into our CI/CD pipeline as a synthetic monitoring solution. It'll run once, one iteration every 15 minutes. If the application fails alerts will be enabled and sent to the appropriate personnel.
Currently, I don't see any locust help with an iteration count command line option. I do see a --run-time option but that doesn't specify how many times it runs vs the amount of time to run.
If you add locust-plugins there is now a way to do this, using command line parameter -i. See https://github.com/SvenskaSpel/locust-plugins#command-line-options

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

How can one view the partial output of a job in PBS that has exceeded its walltime?

I'm new to using cluster computers to run experiments. I have a script running in python that should be regularly printing out information, but I find that when my job exceeds its walltime, I get no output at all except the notification that the job has been killed.
I've tried regularly flushing the buffer to no avail, and was wondering if there was something more basic that I'm missing.
Thanks!
I'm guessing you are having issues with a job cleanup script in the epilogue. You may want to ask the admins about it. You may also want to try a different approach.
If you were to redirect your output to a file in a shared filesystem you should be able to avoid data loss. This assumes you have a shared filesystem to work with and you aren't required to stage in and stage out all of your data.
If you reuse your submission script you can avoid clobbering the output of other jobs by including the $PBS_JOBID environment variable in the output filename.
script.py > $PBS_JOBID.out
On mobile so check qsub man page for a list of job environment variables.

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.