Google refine cross-reference between row and column - data-cleaning

I'm not sure if this can be achieved in Google Refine at all. But basically, I have data like this.
The first table is the table of all the users. The second table show all the friends. However, in the second table in "friends" column not all the id exists in the first table which I want to get rid of. So, how can I search each id in friends column in the second table and get rid of the id that doesn't exists in the table 1?

Put the two tables in different projects (we'll call them Table1 and Table2).
In Table2 on on the friends column:
use "split multi-valued cells" to get each value on a separate row
convert the visitors column to numbers (or conversely user_id in Table1 to string)
use "add a new column based on this column" with the expression cross(cell,'Table1','user_id').length()
This will return 0 if there's no match, 1 if there's a match or N>1 if there are duplicates in Table1
If you want the data back in the original format, set up a facet to filter on the validity column, blank out all the bad values and then use "join multi-valued cells" to reverse the split operation you did up front.
I fixed some caching bugs with cross() for OpenRefine 2.6, so if the cross doesn't work, try stopping and restarting the Refine server.

Related

How do I get a VIEW value into a table?

I'm working on a personal project using PostgreSQL and I'm stuck trying to figure out what the best way to approach this problem:
So in my project there's two tables: USERS and MATCHES. Inside MATCHES there's match results, one row for each, with the columns being ID, player1, player2, player1result, player2result, and winner. While in the USERS table there's ID, name, MatchesWon, MatchesTied, MatchesLost, GoalsScored, GoalsAgainst, and GoalDifference. What I would like to do is take these match results, analyze them and send them to the USERS table.
So for example, if in 5 matches, the winner was "Tom", the column MatchesWon for "Tom" should be 5. Now I can do this using a view, and I have, but there's a couple of things that I do not yet understand.
Can I use a view as a value? For example, create a table and determine that the MatchesWon value will use a view that counts all matches where winner equals that row's "name" value?
If 1 is possible or impossible, either way: is there a better way to do this?
Thank you for your help!

how to paginate ordering by non-distinct / non-unique values in PostgreSQL?

How can I properly page by ordering on a column that could possibly have repeated values? I have a table called posts, which has a column that holds the number of likes of a certain post, called num_likes, and I want to order by num_likes DESC. But the image below shows a problem that I run into - the new row inserted between the two pages causes repeated data to be fetched.
This link here explains the problem, and gives the solution of keyset pagination, but from what I've seen, that only works if the column that the rows are being sorted on are distinct / unique. How would I do this if that is not the case?
You can easily make the sort key unique by adding the primary key to it.
You don't have to display the primary key to the user, just use it internally to tell “equal” rows apart.
For querying and indexing, you can make use of PostgreSQL's ability to compare like this: (num_likes, id) >= (4, 325698).

Tableau Union Joins - Can you un-merge automatically merged fields?

I am attempting to union join 2 tables in my Microsoft NAV data source within Tableau. However, I have two field named "No." that do not contain the same data.
When I apply a union join, Tableau automatically merges these fields and I cannot un-merge them.
Is there a way to un-merge these fields?
Or is there a way of doing a manual union join?
I have tried renaming the field before dragging the second table into the worksheet however I can see that the "Remote Field Name" still remains the same.
Thanks
One approach is to let Tableau merge the fields and then use the generated fields to distinguish between them.
When you perform a Union in Tableau, it adds a few fields to your data source so you can tell which data rows came from which tables. The most useful in your case is called [Table Name]. So when you build your visualizations, you can use the [Table Name] field to know how to interpret the [No.] field.
If that is awkward, you can create 2 calculated fields to represent only those [No.] values that have the same role. For example, define [No. Type 1] as if [Table Name] = “Table 1” then [No.] end. And define, [No. Type 2] similarly. Then you can hide the original [No.] field.
These new fields will only have values for the appropriate data rows, and will be null otherwise. Aggregate functions like SUM(), AVG() etc ignore nulls, so you can use those fields as measures easily.
If you want to use a calculation in a JOIN clause, say after making a UNION, then first specify the tables (or unions of tables) to join, then when you click on the Venn diagram to specify the join keys, and then select either the left or right list of fields --> look at the bottom of the list in small print to either create or edit your Join Calculation.

Filter and display database audit / changelog (activity stream)

I'm developing an application with SQLAlchemy and PostgreSQL. Users of the system modify data in 8 or so tables. Consider this contrived example schema:
I want to add visible logging to the system to record what has changed, but not necessarily how it has changed. For example: "User A modified product Foo", "User A added user B" or "User C purchased product Bar". So basically I want to store:
Who made the change
A message describing the change
Enough information to reference the object that changed, e.g. the product_id and customer_id when an order is placed, so the user can click through to that entity
I want to show each user a list of recent and relevant changes when they log in to the application (a bit like the main timeline in Facebook etc). And I want to store subscriptions, so that users can subscribe to changes, e.g. "tell me when product X is modified", or "tell me when any products in store S are modified".
I have seen the audit trigger recipe, but I'm not sure it's what I want. That audit trigger might do a good job of recording changes, but how can I quickly filter it to show recent, relevant changes to the user? Options that I'm considering:
Have one column per ID type in the log and subscription tables, with an index on each column
Use full text search, combining the ID types as a tsvector
Use an hstore or json column for the IDs, and index the contents somehow
Store references as URIs (strings) without an index, and walk over the logs in reverse date order, using application logic to filter by URI
Any insights appreciated :)
Edit It seems what I'm talking about it an activity stream. The suggestion in this answer to filter by time first is sounding pretty good.
Since the objects all use uuid for the id field, I think I'll create the activity table like this:
Have a generic reference to the target object, with a uuid column with no foreign key, and an enum column specifying the type of object it refers to.
Have an array column that stores generic uuids (maybe as text[]) of the target object and its parents (e.g. parent categories, store and organisation), and search the array for marching subscriptions. That way a subscription for a parent category can match a child in one step (denormalised).
Put a btree index on the date column, and (maybe) a GIN index on the array UUID column.
I'll probably filter by time first to reduce the amount of searching required. Later, if needed, I'll look at using GIN to index the array column (this partially answers my question "Is there a trick for indexing an hstore in a flexible way?")
Update this is working well. The SQL to fetch a timeline looks something like this:
SELECT *
FROM (
SELECT DISTINCT ON (activity.created, activity.id)
*
FROM activity
LEFT OUTER JOIN unnest(activity.object_ref) WITH ORDINALITY AS act_ref
ON true
LEFT OUTER JOIN subscription
ON subscription.object_id = act_ref.act_ref
WHERE activity.created BETWEEN :lower_date AND :upper_date
AND subscription.user_id = :user_id
ORDER BY activity.created DESC,
activity.id,
act_ref.ordinality DESC
) AS sub
WHERE sub.subscribed = true;
Joining with unnest(...) WITH ORDINALITY, ordering by ordinality, and selecting distinct on the activity ID filters out activities that have been unsubscribed from at a deeper level. If you don't need to do that, then you could avoid the unnest and just use the array containment #> operator, and no subquery:
SELECT *
FROM activity
JOIN subscription ON activity.object_ref #> subscription.object_id
WHERE subscription.user_id = :user_id
AND activity.created BETWEEN :lower_date AND :upper_date
ORDER BY activity.created DESC;
You could also join with the other object tables to get the object titles - but instead, I decided to add a title column to the activity table. This is denormalised, but it doesn't require a complex join with many tables, and it tolerates objects being deleted (which might be the action that triggered the activity logging).

Swap the order of items in a SQLite database

I retrieve an ordered list of items from a table of items in a Sqlite Database. How can I swap the id so the order of two items in the Sqlite database table?.
The id shouldn't determine position or ordering. It should be an immutable identifier.
If you need to represent order in a database you need to create another orderNumber column. A couple options are (1) either have values that span a range or (2) have a pointer to next (like a linked list).
For ranges: Spanning a range helps you avoid rewriting the orderNumber column for all items after the insert point. For example, in the range, insert first gets 1, insert 2nd gets max range, insert 3rd between first and second gets mid-range number - if you reposition you have to assign mid-points of the items it's between. One downside is if the list gets enough churn (minimized by a large span) you may have to rebalance the ranges. The pro of this solution is you can get the ordered list just by ordering by this column in the sql statement.
For linked list: If the database has a next column that points to the id that's after it in order, you need to update a couple rows to insert something. Upside is it's simple. Downside is you can't order in the sql statement - you're relying on the code getting the list to sort it.
One other variation is you could pull the ordered list data out of that table altogether. For example, you could have an ordered list table that has listid, itemid, orderedNumber. That allows you to have one or multiple logical ordered lists of the items in that table it references.
Some other references:
How to store ordered items which often change position in DB
Best way to save a ordered List to the Database while keeping the ordering
https://dba.stackexchange.com/questions/5683/how-to-design-a-database-for-storing-a-sorted-list