Openrefine: cross cluster two dataset

Openrefine: cross cluster two dataset - cluster-analysis

I've got two datasets with titles and other informations, but in dataset A I have titles, in dataset B I have titles and URL.
I have to put the URL in dataset A from dataset B. Some titles are the same in A and B, some others are not, some others are slightly different (and here comes the problem).
So I need to merge and cluster at the same time those who are similar. I know that I can reconcile with DBpedia, but what I need is to "reconcile" between the two dataset.
Is it possible in some way?
Thank you.

You can use reconcile-csv application (it's not plugin for OpenRefine, but standalone program that runs local reconciliation API server).
Export dataset B as csv with first row as column names, then start reconcile-csv, using URL as id column and name as search column:
java -Xmx2g -jar reconcile-csv-0.1.2.jar <CSV-File> <Search Column> <ID Column>
Then open dataset A and add http://localhost:8000/reconcile as reconciliation service. After reconciliation, cell.recon.match.id for each reconciled cell will contain URL.

Related

How to filter prometheus series based on the results of another query in grafana dashboard?

I am using Grafana 9.3.1 for monitoring of our system. Among other things, I am trying to monitor the remaining FUP of a phone number for each unit we operate.
Basically, we intend to use two data sources.
Database mapping of the unit ID to its phone number (e.g. "unit_id=123, phone_number="00 123456789")
Prometheus time series remaining_fup{phone_number="00 123456789"}. However, remaining_fup is a 3rd party data and does not include unit_id.
In my unit-detail dashboard I have unit_id variable which indicates which unit FUP should be displayed (among other things depending on unit_id)
My original approach was this:
Create a mixed datasource dashboard
Add database datasource as data A. SELECT phone_number FROM units WHERE unit_id='$unit_id'
Add prometheus datasource remaining_fup and filter it based on A.phone_number: remaining_fup{phone_number="${A.phone_number}"}
Unfortunatelly such use of A isn't supported. I used to hope for applying some transformation like Merge or Join by field and then Filter but with no success. After a lot of googling and trying I feel hopeless.
Could you help please? Is such filter even possible? Thanks!
TL;DR: In grafana dashboard I want to query one datasource in order to obtain a value which I subsequently want to use in another datasource query.

1.) Create variable - name phone_number, type: Query and query your database datasource SELECT phone_number FROM units WHERE unit_id='$unit_id'. You can hide this variable if you don't want it to be visible for the dashboard users.
2.) Variable phone_number may have multiple values, so use advance variable formatting to create valid regex query syntax for your prometheus datasource, e.g.
remaining_fup{phone_number=~"${phone_number:pipe}"}
Of course this queries are just examples and they may need some (syntax) tweaking for the use case. Main idea: don't use 2 queries, but one variable and one query (where you use that variable).

Use SQL Select directly in CR Function

There's a working report but I wan't to change the visibility of certain rows based on a completely new DB select that should be executed when the report is created.
It would be ideal, if I could load the values of said select in an array or a list and than simply trigger certain row's visibility by comparing e.g. the Row Id with the values in the array.
Im used to solve a problem like this by creating a View that delivers all the essential information in each row and is used as the main data source but I was wondering if there's an elegant way within crystal reports to solve such a task.

I can think of three ways to include control data like this into the report:
One row of config data: If you can arrange it that your config data query returns one row of data, it can be just added to the data sources of the main report, without any links to tables and views already there.
Config resultset for matching: If you have to match main data results to config values row by row, e.g. by the Row Id you mentioned, add this config query to your report and link it to the main data source accordingly. (You are probably already doing this within your pre-created view on database side.)
Query config by subreport: Most flexible but also time consuming option is to add a subreport in report header, add the config data query and arrange config results into (shared) variables as needed in the subreport. Shared variable values can be used in main report then to control section visibility.

Yes, one of the 3rd-party Crystal Reports UFLs (User Function Libraries) listed here provides such a function.

Case-insensitive column names breaks the Data Preview mode in Data Flow of Data Factory

I have a csv file in my ADLS:
a,A,b
1,1,1
2,2,2
3,3,3
When I load this data into a delimited text Dataset in ADF with first row as header the data preview seems correct (see picture below). The schema has the names a, A and b for columns.
However, now I want to use this dataset in Mapping Data Flow and here does the Data Preview mode break. The second column name (A) is seen as duplicate and no preview can be loaded.
All other functionality in Data Flow keeps on working fine, it is only the Data Preview tab that gives an error. All consequent transformation nodes also gives this error in the Data Preview.
Moreover, if the data contains two "exact" same column names (e.g. a, a, b), then the Dataset recognizes the columns as duplicates and puts a "1" and "2" after each name. It is only when they are case-sensitive unequal and case-insensitive equal that the Dataset doesn't get an error and Data Flow does.
Is this a known error? Is it possible to change a specific column name in the dataset before loading into Data Flow? Or is there just something I'am missing?

I testes it and get the error in source Data Preview:
I ask Azure support for help and they are testing now. Please wait my update.
Update:
I sent Azure Support the test.csv file. They tested and replied me. If you insist to use " first row as header", Data Factory can not solve the error. The solution is that re-edit the csv file. Even in Azure SQL database, it doesn't support we create a table with same column name. Column names are case-insensitive.
For example, this code is not supported:
Here's the full email message:
Hi Leon,
Good morning! Thanks for your information.
I have tested the sample file you share with me and reproduce the issue. The data preview is alright by default when I connect to your sample file.
But I noticed when we do the trouble shooting session – a, A, b are column name, so you have checked the first row as header your source connection. Please confirm it’s alright and you want to use a, A, b as column headers. If so, it should be a error because there’s no text- transform for “A” in schema.
Hope you can understand the column name doesn’t influence the data transform and it’s okay to change it to make sure no errors block the data flow.
There’re two tips for you to remove block, one is change the column name from your source csv directly or you can click the import schema button ( in the blow screen shot) in schema tab, and you can choose a sample file to redefine the schema which also allows you to change column name.
Hope this helps.

Header Row Insert

I am trying to insert data to big query using google cloud dataprep, I did create recipe and add first row as header row, but when I am trying to run on multiple files it insert the header row to my big query table also.
Anybody facing this problem ?

Welcome to StackOverflow, Andy!
I think I'm correctly understanding your problem, but I want to make sure since I'm making some assumptions:
You have multiple files imported in Dataprep
You created a recipe for the first file and convert row 1 to be a header
You apply a UNION step to merge the additional files
Your output contains the header rows for the additional files
If that's correct, the issue is that the header rows in the other files aren't being removed simply because Dataprep doesn't know what they are. In most cases, Dataprep will detect the file structure and you won't have to manually specify the header row. When that fails, however, UNION steps get a little funny like this—but you can definitely fix it in Dataprep.
Workarounds:
Apply a Recipe to Each Input File
Simply add a recipe to each file that converts the first row to a header—then instead of selecting your original file in the main recipe's UNION, select the other recipes (Dataprep will run them before merging the data).
While this takes some extra effort, it's doable for a small number of files. The advantage here is that you don't have to worry about whether your data may contain the header value—but I'd recommend using the other option if you're able to.
Use a Custom Filter Formula to Delete All Header Rows
The other option is a bit more dependent on your data, but lets you do everything in the main recipe. For example, after setting headers from the first file and applying your UNION you would add a "Filter rows using custom formula" step (or clicking Filter Rows > On column values > Custom Filter...), then match using a column that wouldn't contain the header string (e.g. CustomerID == "CustomerID")—integer columns work great since you don't have to worry if the value could contain the header string. The Resulting wrangle script should look something like this:
header sourcerownumber: 1
[union step goes here]
filter type: custom rowType: single row: CustomerID == 'CustomerID' action: Delete
Note: You may be tempted to do this by using $sourcerownumber, but that doesn't exist due to the union. I'm hoping that they'll eventually support it for this use case though.
These aren't the only ways you could eliminate the headers, but should provide two easy options for you.
As a pro-tip, you can copy a line of the wrangle script above and paste it after clicking "New Step" in your recipe and it'll set up the filter the same way that I did so you don't have to start from scratch. Just change the column name/value and you should be good to go.
Again, welcome to the site—and if any of the assumptions above are incorrect, update your original question with the additional details and let me know in a comment and I'll be happy help you out further.

Tableau MarkLogic Data Modelling

I am using Tableau with MarkLogic. I have the following XML Structure
<CustomerInformation CustomerId="1">
<CustomerBasicInformation>
<CustomerTitle></CustomerTitle>
<CustomerFirstName></CustomerFirstName>
<CustomerMiddleName></CustomerMiddleName>
<CustomerLastName></CustomerLastName>
</CustomerBasicInformation>
<CustomerEmplyomentDetails>
<CustomerEmployer>
<EmployerName IsCurrentEmployer=""></EmployerName>
<CustomerDesignation></CustomerDesignation>
<EmployerLocation></EmployerLocation>
<CustomerTenure></CustomerTenure>
</CustomerEmployer>
<CustomerEmplyomentDetails>
<PolcyDetails>
<Policy PolicyId="">
<PolicyName></PolicyName>
<PolicyType></PolicyType>
<PolicyCategory></PolicyCategory>
<QuoteNumber></QuoteNumber>
<PolicyClaimDetails>
<PolicyClaim ClaimId="">
<PolicyClaimedOn></PolicyClaimedOn>
<PolicyClaimType></PolicyClaimType>
<PolicyClaimantName></PolicyClaimantName>
</PolicyClaim>
</PolicyClaimDetails>
<PolicyComplaintDetails>
<PolicyComplaint ComplaintId="">
<PolicyComplaintStatus></PolicyComplaintStatus>
<PolicyComplaintOn></PolicyComplaintOn>
</PolicyComplaint>
</PolicyComplaintDetails>
<BillingDetails>
<Billing BillingId="">
<BillingAmount></BillingAmount>
<BillingMode></BillingMode>
</Billing>
</BillingDetails>
</Policy>
<Policy PolicyId="">
<PolicyName></PolicyName>
<PolicyType></PolicyType>
<PolicyCategory></PolicyCategory>
<QuoteNumber></QuoteNumber>
<PolicyClaimDetails>
<PolicyClaim ClaimId="">
<PolicyClaimedOn></PolicyClaimedOn>
<PolicyClaimType></PolicyClaimType>
<PolicyClaimantName></PolicyClaimantName>
</PolicyClaim>
</PolicyClaimDetails>
<PolicyComplaintDetails>
<PolicyComplaint ComplaintId="">
<PolicyComplaintStatus></PolicyComplaintStatus>
<PolicyComplaintOn></PolicyComplaintOn>
</PolicyComplaint>
</PolicyComplaintDetails>
<BillingDetails>
<Billing BillingId="">
<BillingAmount></BillingAmount>
<BillingMode></BillingMode>
</Billing>
</BillingDetails>
</Policy>
</PolcyDetails>
</CustomerInformation>
I have created a view on above structure.
Initially I have created a single view for all elements, but on Tableau I got duplicate values as well as Cartesian join result.
So to tackle this, I used approach of fragment root.
Since there can be multiple PolicyDetails for single customer. I have created fragment root on Policy.
Similarly Claims, Complaints, Billing, Quote can be multiple for single policy, I have created fragment root on each one of them.
Now after doing this it resolves the duplicate issue as well as Cartesian join result set. It gives unique set of record for each entities (CustomerInfo, Policy, Claims, Complaints, Quote, Employer, Billing).
However I am not able to relate this entities with each other (as in foreign-primary key).
I have created the following view with element scope and all. I am pasting only Customer and Policy details, if this resolves other entities can be similarly managed
view:create(
"InsurancePOC",
"CustomerBasicInfo",
view:element-view-scope(xs:QName("CustomerInformation")),
(
view:column("CustomerId", cts:element-attribute-reference(xs:QName("CustomerInformation"), xs:QName("CustomerId"))),
view:column("PolicyId", cts:element-attribute-reference(xs:QName("Policy"), xs:QName("PolicyId"))),
view:column("QuoteNumber", cts:element-attribute-reference(xs:QName("Quote"), xs:QName("QuoteNumber"))),
view:column("ComplaintId", cts:element-attribute-reference(xs:QName("PolicyComplaint"), xs:QName("ComplaintId"))),
view:column("BillingId", cts:element-attribute-reference(xs:QName("Billing"), xs:QName("BillingId"))),:)
view:column("CustomerFirstName", cts:element-reference(xs:QName("CustomerFirstName"))),
view:column("CustomerLastName", cts:element-reference(xs:QName("CustomerLastName")))
),
(),
()
),
view:create(
"InsurancePOC",
"PolcyInfo",
view:element-view-scope(xs:QName("Policy")),
(
view:column("PolicyId", cts:element-attribute-reference(xs:QName("Policy"), xs:QName("PolicyId"))),
view:column("PolicyName", cts:element-reference(xs:QName("PolicyName"))),
view:column("PolicyType", cts:element-reference(xs:QName("PolicyType")))
),
(),
()
)
All pre-requisites like element-range index and all is been done.
I am trying to relate these entities using view:column("PolicyId", cts:element-attribute-reference(xs:QName("Policy"), xs:QName("PolicyId"))) in CustomerBasicInfo view.
If I do so it shows zero results in Tableau or Query console.
If I remove it, gives unique record but without any relationship with each other.
All I want is to achieve relationship between Policy-Customer
Kindly go through the code snippet, if more clarification required please let me know

The getting of cartesian join results is a known issue with the SQL views driven from Range indexes in MarkLogic, particularly with aggregate docs like above.
The simplest way to solve it for SQL views would be to split your docs into separate Policies, with embedded copies of the customer into. That could mean a fair amount of data duplication if customers often have multiple policies.
You could also consider taking these docs apart, and storing policies and customer details separately, with id refs from policy to customer, so that you can join them together afterwards, in Tableau, or SQL.
MarkLogic 9 comes with a new feature though, that would prevent the need for all this. It is called Template Driven Extraction. It also provides SQL views on data, but works in a different way. It is driven with a match pattern (called the context) that controls the rows in the view. You would use Policy as context in this case. From there you would use relative paths to go up the tree to customer details, and down to get policy details.
TDE templates are installed using tde:template-insert. The documentation of that function shows a simple example of such a TDE:
http://docs.marklogic.com/tde:template-insert
You can also play around with tde:node-data-extract first, to get the hang of it.
HTH!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse