is there a way dump a TSV file from Storage Bucket to Cloud MySql in GCP? - google-cloud-storage

is there a way to dump a TSV file from Storage Bucket to Cloud MySql in GCP ?. I have large file of TSV with 4M rows.
I couldn't convert it into CSV.

As of today, Cloud SQL only supports CSV and SQL. Nonetheless, I suggest that you take a look at this solution. I used Python to be able to automate this process in case you really need it to make it more than one time. In this case I tried to reproduce your issue and I code a script that basically:
Downloads the TSV file from the Cloud Storage Bucket specified.
Converts the TSV file to a CSV file. Uploads the CSV file to the
Cloud Storage Bucket specified.
Imports the newly added CSV file to
Cloud SQL.
You can find the code as well as the requirements for running this script here. Furthermore, take into account that you will need to replace those values closed by claudators such as [BUCKET_NAME] before running it. Also keep in mind that this script does not delete the TSV download it as well as the CSV file, therefore you will need to delete it manually or you can modify the code in order to delete the files automatically.
Finally, if you would like to investigate further about the API used on the script section, I will attach the documentation need it here & here.

Related

Confusion in the dbutils.fs.ls() command output. Please suggest

When I use the below command in Azure Databricks
display(dbutils.fs.ls("/mnt/MLRExtract/excel_v1.xlsx"))
My output is coming as wasbs://paycnt#sdvstr01.blob.core.windows.net/mnt/MLRExtract/excel_v1.xlsx
not as expected-- dbfs://mnt/MLRExtract/excel_v1.xlsx
Please suggest
Mounting a storage account to Databricks File System allows users to access them any number of times without any credentials. Any files or directories can be accessed from Databricks clusters using these mount points. The procedure you used allows you to mount blob storage container to DBFS.
So, you can access your blob storage container from DBFS using the mount point. The method dbutils.fs.ls(<mount_point>) displays all the files and directories available in that mount point. It is not necessary to provide path of a file, instead simply use:
display(dbutils.fs.ls(“/mnt/MLRExtract/”))
The above command returns all the files available in the mount point (which is your blob storage container). You can perform all the required operations and then write to this DBFS, which will be reflected in your blob storage container too.
Refer to the following link to understand more about Databricks file system.
https://docs.databricks.com/data/databricks-file-system.html

Jmeter csv data split

I am load testing using Jmeter containers inside k8s cluster.Right now the jmx and the csv files are copied to all the containers.Is there a way to split the data file so that each JMeter instance in container gets its own subset of the original file?
Are you looking for split command or what? The number of lines in file and the number of pods in cluster can be obtained using wc command
Also there might be better solutions like using HTTP Simple Table Server or Redis Data Set so the test data would be stored in a "central" location and you won't have to bother about copying splitting it and copying the parts to the slaves

Google Cloud SQL PostgreSQL replication?

I want to make sure that there's not a better (easier, more elegant) way of emulating what I think is typically referred to as "logical replication" ("logical decoding"?) within the PostgreSQL community.
I've got a Cloud SQL instance (PostgreSQL v9.6) that contains two databases, A and B. I want B to mirror A, as closely as possible, but don't need to do so in real time or anything near that. Cloud SQL does not offer the capability of logical replication where write-ahead logs are used to mirror a database (or subset thereof) to another database. So I've cobbled together the following:
A Cloud Scheduler job publishes a message to a topic in Google Cloud Platform (GCP) Pub/Sub.
A Cloud Function kicks off an export. The exported file is in the form of a pg_dump file.
The dump file is written to a named bucket in Google Cloud Storage (GCS).
Another Cloud Function (the import function) is triggered by the writing of this export file to GCS.
The import function makes an API call to delete database B (the pg_dump file created by the export API call does not contain initial DROP statements and there is no documented facility for adding them via the API).
It creates database B anew.
It makes an API call to import the pg_dump file.
It deletes the old pg_dump file.
That's five different objects across four GCP services, just to obtain already existing, native functionality in PostgreSQL.
Is there a better way to do this within Google Cloud SQL?

Pyspark dataframe write parquet without deleting /_temporary folder

df.write.mode("append").parquet(path)
I'm using this to write parquet files to an S3 location. It seems like in order to write the files, it's also creating a /_temporary directory and deleting it after use. So I got access denied. Admin on our AWS account doesn't want to grant the code delete permission on that folder.
I proposed to write the files to another folder where delete permission can be granted then copy the files over. But Admin still wants me to write files directly to the destination folder.
Is there certain configuration I can set to ask Pyspark don't do the deletion on the temporary directory?
I don't think there is such option for _temporary folder.
But if you're running your Spark job on EMR cluster, you can write first to local HDFS of your cluster and then copy data to S3 using Hadoop FileUtil.copy function.
On Pyspark, you can access this function via JVM gateway like this :
sc._gateway.jvm.org.apache.hadoop.fs.FileUtil

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.