Sample collection is failing for dataprep - google-cloud-dataprep

Google Cloud dataprep is showing summary of only sampled data at the top. This is not the analysis of all distinct data.
Now when I try to collect sample of data based on unique values in column then underlying dataflow job is failing.
Error that it gives is - for custom mode network, I need to specify subnetwork where job will run. However there is no control given to modify where sampling job will run?
Is there workaround or way so that I see full data instead of sample?

Change the VPC under the user profile

Related

Difference between DataFlow and Pipelines

I do not understand the difference between dataflow and pipeline in Azure Data Factory.
I have read and see DataFlow can Transform Data without writing any line of code.
But I have made a pipeline and this is exactly the same thing.
Thanks
A Pipeline is an orchestrator and does not transform data. It manages a series of one or more activities, such as Copy Data or Execute Stored Procedure. Data Flow is one of these activity types and is very different from a Pipeline.
Data Flow performs row and column level transformations, such as parsing values, calculations, adding/renaming/deleting columns, even adding or removing rows. At runtime a Data Flow is executed in a Spark environment, not the Data Factory execution runtime.
A Pipeline can run without a Data Flow, but a Data Flow cannot run without a Pipeline.
Firstly, dataflow activity need to be executed in the pipeline. So I suspect that you are talking about the copy activity and dataflow activity as both of them are used for transferring data from source to sink.
I have read and see DataFlow can Transform Data without writing any
line of code.
Your could see the overview of Data Flow. Data flow allows data engineers to develop graphical data transformation logic without writing code. All data transfer steps are based on visual interfaces.
I have made a pipeline and this is exactly the same thing.
Copy activity could be used for data transmission. However, it has many limitations with column mapping. So,if you just need simple and pure data transmission, Copy Activity could be used. In order to further meet the personalized needs, you could find many built-in features in the Data Flow Activity. For example, Derived column, Aggregate,Sort etc.

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket

Loading many tables in Cloud Data Fusion fails with DAG error

I have an MS SQL Server data source with around 1000 tables, which I need to put into BigQuery. I was hoping to use Data Fusion to load them all into staging tables in BigQuery, and then perform transformations on them afterwards. However, as soon as I create a pipeline with two "islands" it give a DAG error. Is that a feature or a just something I'm doing wrong? I can't find anything in the documentation. My pipeline looks like this:
And the error I get when I try to deploy is: "Invalid DAG. There is an island made up of stages BigTest,BigQuery BigTest (no other stages connect to them)."
Each pipeline is a single DAG (Directed acyclic graph) and all the source and sink should be connected for the configuration to be valid. You can use multi-table source plugin that can bring in multiple tables at once to a landing table in BQ.
You can use Multi table plugins and BQ Multi table sink for your use-case.

SonarQube DB lacking values

I connected my sonarqube server to my postgres db however when I view the the "metrics" table, it lacks the actual value of the metric.
Those are all the columns I get, which are not particularly helpful. How can I get the actual values of the metrics?
My end goal is to obtain metrics such as duplicate code, function size, complexity etc. on my projects. I understand I could also use the REST api to do this however another application I am using will need a db to extract data from.
As far as i know connecting to db just helps to store data, not to display data.
You can check stored data on sonarqube's gui
Click on project
Click on Activity

How to log a talend job result?

We have many talend jobs to transfer data from oracle (tOracleInput) to redshift (tRedshiftOutputBulkExec). I would like to store the result information into a DB table. For example:
Job name, start time, running time, rows loaded, successful or failed
I know if I turn on log4j, most those information can be derived from the log. However, saving it into DB table will make it easy to check and report the result.
I'm most interested in records loaded. I checked this link http://www.talendbyexample.com/talend-logs-and-errors-component-reference.html and manual of tRedshiftOutputBulkExec. None of them gives me such information.
Will Talend Administration Center provide such function? What is the best way to implement it?
Thanks,
After looking at the URL you provided, tLogCatcher should provide you with what you need (minus the rows loaded, which you can get with a lookup).
I started with Talend Studio Version 6.4.1. There you can set "Stats & Logs" for a job. It can log to console, files or database. When writing to a DB you set JDBC parameters and the name for three tables:
Stats Table: stores start and end timestamps of the job
Logs Table: stores error messages
Meter Table: stores the count of rows for each monitored flow
They correspond to the components tStatCatcher, tLogCatcher, tFlowMeterCatcher, where you can find the needed table schema.
To make a flow monitored select it, open tab "Component" and mark the checkbox "Monitor this connection".
To see the logged values you can use the "AMC" (Active Monitoring Console) in Studio or TAC.