What do we enter in the parameter field when we use "most trusted source" as the survivorship function (i.e., using t-swoosh algorithm in tMatchGroup) - talend

I would like to create a master record from customer listing in multiple sources (i.e., Golden Customer Record / Master Data Management) using a Talend job. Research indicates that the tMatchGroup is the best component to perform this as it is capable of merging records base on survivorship rules.
My question is, if I would like to use the "most trusted source" survivorship function, how do I list the source ranking in the parameter field when I use the t-swoosh algorithm? The documentation does not show how to do this and I can't find anything online.
This is the documentation I am referring to. Any advise would be much appreciated.

Related

how to quickly locate which sheets/dashboards contain a field?

I am creating a data dictionary and I am supposed to track the location of any used field in a workbook. For example (superstore sample data), I need to specify which sheets/dashboards have the [sub-category] field.
My dataset has hundreds of measures/dimensions/calc fields, so it's incredibly time exhaustive to click into every single sheet/dashboard just to see if a field exists in there, so is there a quicker way to do this?
One robust, but not free, approach is to use Tableau's Data Catalog which is part of the Tableau Server Data Management Add-On
Another option is to build your own cross reference - You could start with Chris Gerrard's ruby libraries described in the article http://tableaufriction.blogspot.com/2018/09/documenting-dashboards-and-their.html

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

tiki wiki how to permanently hide a tracker plugin from a user once saved

I'm trying to implement a read confirmation in a number of wiki pages.
I'm trying to use trackers.
General Description:
Employees in our company are assigned to read a number of official procedures.
I'm trying to implement a process where:
Each employee is assigned procedures he needs to read according to his department (Group).
Each procedure is a wiki page
At the end of each procedure there will be a confirmation form in the following format.
Users that don't need to read this procedure won't see this form.
Users that confirmed reading the document will see a message like:
You've confirmed reading this procedure.
Administrators will be able to monitor who read what procedure.
Questions:
How do I hide the tracker plugin from users who don't belong to the department (Group)
How do I display a different message once the user confirmed the read
Thanks
There are different way to achieve this and it require a bit of thinking (trade-off from a method to another), but this is what I’ll do.
Have 2 groups (before approving - after approving)
Display the procedures using a plugin listExecute and having at the end the approving checkbox to have some actions (notification, group changes, etc).
Enclosing everything in a tracker and turn it into a multipage forms can also be the way.
Your case remind me other use case I worked on including Official Procedures reading including quick test (to check if the procedures were understood) and approving mechanism. Look at https://doc.tiki.org/PluginExercise ;)
Good luck
Bernard
https://www.facebook.com/bsfez
Another, possibly simpler (? ;) way to do that would be to use just plugins group and list, maybe like this:
{GROUP(groups="This Department")}
{LIST()}
{filter type="trackeritem"}
{filter field="tracker_id" content="42"}
{filter field="tracker_field_procedurePage" content="{{page}}"}
{filter field="tracker_field_userLogin" content="{{user}}"}
{OUTPUT()}~tc~Item found, so already done~/tc~You already did this bit{OUTPUT}
{ALTERNATE()}~tc~Nothing found, show the form~/tc~{tracker trackerId=42 etc...}{ALTERNATE}
{LIST}
{ELSE}
You don't need to fill in the form
{GROUP}
This is totally untested i'm afraid, and i'm not 100% sure you can use a plugin in the ALTERNATE section, but give it a go? If it doesn't work, try using {display format="wiki plugin" etc...} which might do the trick - good luck!

Billing by tag in Google Compute Engine

Google Compute Engine allows for a daily export of a project's itemized bill to a storage bucket (.csv or .json). In the daily file I can see X-number of seconds of N1-Highmem-8 VM usage. Is there a mechanism for further identifying costs, such as per tag or instance group, when a project has many of the same resource type deployed for different functional operations?
As an example, Qty:10 N1-Highmem-8 VM's are deployed to a region in a project. In the daily bill they just display as X-seconds of N1-Highmem-8.
Functionally:
2 VM's might run a database 24x7
3 VM's might run batch analytics operation averaging 2-5 hrs each night
5 VM's might perform a batch operation which runs in sporadic 10 minute intervals through the day
final operation writes data to a specific GS Buckets, other operations read/write to different buckets.
How might costs be broken out across these four operations each day?
The Usage Logs do not provide 'per-tag' granularity at this time and it can be a little tricky to work with the usage logs but here is what I recommend.
To further break down the usage logs and get better information out of em, I'd recommend trying to work like this:
Your usage logs provide the following fields:
Report Date
MeasurementId
Quantity
Unit
Resource URI
ResourceId
Location
If you look at the MeasurementID, you can choose to filter by the type of image you want to verify. For example VmimageN1Standard_1 is used to represent an n1-standard-1 machine type.
You can then use the MeasurementID in combination with the Resource URI to find out what your usage is on a more granular (per instance) scale. For example, the Resource URI for my test machine would be:
https://www.googleapis.com/compute/v1/projects/MY_PROJECT/zones/ZONE/instances/boyan-test-instance
*Note: I've replaced the "MY_PROJECT" and "ZONE" here, so that's that would be specific to your output along with the name of the instance.
If you look at the end of the URI, you can clearly see which instance that is for. You could then use this to look for a specific instance you're checking.
If you are better skilled with Excel or other spreadsheet/analysis software, you may be able to do even better as this is just an idea on how you could use the logs. At that point it becomes somewhat a question of creativity. I am sure you could find good ways to work with the data you gain from an export.
9/2017 update.
It is now possible to add user defined labels, then track usage and billing by these labels for Compute and GCS.
Additionally, by enabling the billing export to Big Query, it is then possible to create custom views or hit Big Query in a tool more friendly to finance people such as Google Docs, Data Studio, or anything which can connect to Big Query. Here is a great example of labels across multiple projects to split costs into something friendlier to organizations, in this case a Data Studio report.

Exporting specifyied records from portal

Hi does anyone know if it's possible to export specified records from a portal in XML? Currently when I filter the portal it exports all records in the relationship and ignores the portal filter. Is it possible to specify which records to export from a portal without modifying the relationship?
Thanks for any help.
Is it possible to specify which records to export from a portal
without modifying the relationship?
Not really. Well, at least in theory, you could go to the related records and perform a Constrain Found Set there to replicate the filter's action. But then you would have to implement the same logic twice, violating the DRY principle.
If you need the filtered results at the data layer (e.g. for export), then it's time to filter the relationship. Portal filtering is meant for display purposes only.
Note: this is assuming that you actually need this filtering for display purposes, too.