Scala Nested Iteration within RDD - scala

I have to iterate through all columns to find similarity of 1 column value. For example:
ID,FN,LN,Phone
-----------
1,James,Butt,872-232-1212
2,Josephine,Darakjy, 872-232-1213
3,Art,Venere,872-232-1214
4,Lenna,Paprocki,872-232-1215
5,Donette, Foller,872-232-1216
6,Jmes,Butt,666-232-1212
7,Donette, Foller,888-232-1216
8,Josphne,Darkjy, 555-232-1213
Inside the loop, I will take FN, which is 'James' and see if I have similar name in the complete data set using some kind string distances (e.g Levenshtein) and in this case I have match with ID#6: 'Jmes', I will create a bucket by adding a new GUID column this:
ID,FN,LN,Phone,GrupId
----------------------
1,James,Butt,872-232-1212,G1
2,Josephine,Darakjy, 872-232-1213,G2
3,Art,Venere,872-232-1214,G3
4,Lenna,Paprocki,872-232-1215,G4
5,Donette, Foller,872-232-1216,G5
6,Jmes,Butt,666-232-1212,G1
7,Donette, Foller,888-232-1216,G5
8,Josphne,Darkjy, 555-232-1213,G2
I have to do same operation on multiple columns, like LN, Phone as well. Imagine if I have 1 million records.
Any thoughts, suggestions or links are appreciated. Thank you!

I would definitely not try anything pairwise and would rather think towards coding a per-field Levenshtein-y index and accumulate results on the fly. I’d probably start from a suffix tree -ish one.
Will try to sketch a prototype as soon as I get to the laptop...
Update: after some reading I am leaning towards Affinity Clustering1 combined with pairwise (yes I know) Levenshtein cached on a Trie2. Code in progress...

Related

Maintain N latest points in Postgres

My purpose is to have N latest points in Postgres in one row. When a new point is added, remove the oldest point.
For example say I have N+1 points. 1 userid and N integers.
Now when a new integer comes for a user, I want to remove the oldest entry and add the new entry. Obviously the issue is performance here. Since I am adding only one new integer, I want some way to do it fast.
I tried one very naive way by keeping two columns
userid, json
where json was list of all integers. I would remove the first entry and append new entry and dump json in postgres. Undoubtedly it is not performing well.
Please suggest some good way to do it. Does Postgres has some min heap type of data structure which does it in much better than linear complexity?

Gremlin query to find the count of a label for all the nodes

Sample query
The following query returns me the count of a label say
"Asset " for a particular id (0) has >>>
g.V().hasId(0).repeat(out()).emit().hasLabel('Asset').count()
But I need to find the count for all the nodes that are present in the graph with a condition as above.
I am able to do it individually but my requirement is to get the count for all the nodes that has that label say 'Asset'.
So I am expecting some thing like
{ v[0]:2
{v[1]:1}
{v[2]:1}
}
where v[1] and v[2] has a node under them with a label say "Asset" respectively, making the overall count v[0] =2 .
There's a few ways you could do it. It's maybe a little weird, but you could use group()
g.V().
group().
by().
by(repeat(out()).emit().hasLabel('Asset').count())
or you could do it with select() and then you don't build a big Map in memory:
g.V().as('v').
map(repeat(out()).emit().hasLabel('Asset').count()).as('count').
select('v','count')
if you want to maintain hierarchy you could use tree():
g.V(0).
repeat(out()).emit().
tree().
by(project('v','count').
by().
by(repeat(out()).emit().hasLabel('Asset')).select(values))
Basically you get a tree from vertex 0 and then apply a project() over that to build that structure per vertex in the tree. I had a different way to do it using union but I found a possible bug and had to come up with a different method (actually Gremlin Guru, Daniel Kuppitz, came up with the above approach). I think the use of project is more natural and readable so definitely the better way. Of course as Mr. Kuppitz pointed out, with project you create an unnecessary Map (which you just get rid of with select(values)). The use of union would be better in that sense.

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

Tableau: Create a table calculation that sums distinct string values (names) when condition is met

I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution

Cassandra CompositeType as row key Validator

I'm working on some POC.
I have the Column Family which stores server event. Avoiding to get row oversize we are splitting each row to N another rows using compositeType in row key:
CREATE COLUMN FAMILY logs with comparator='ReversedType(TimeUUIDType)' and key_validation_class='CompositeType(UTF8Type,IntegerType)' and default_validation_class=UTF8Type;
so for each server name we have N rows and we are writing data to each row using Very Simple Round Robin algorithm.
I have no problem to write data to any row:
Mutator<Composite> mutator = HFactory.createMutator(keySpace, CompositeSerializer.get());
HColumn<UUID,String> col =
HFactory.createColumn( TimeUUIDUtils.getUniqueTimeUUIDinMillis(), log);
Composite rowName = new Composite();
rowName.addComponent(serverName, StringSerializer.get());
rowName.addComponent(this.roundRobinDestributor.getRow(), IntegerSerializer.get());
mutator.insert(rowName, columnFamilyName, col);
}
So far so good, but now I have two quetions:
1) Due to the fact that if I want to get all logs for some serverName I would scan row keys, should I use ByteOrderedPartitioner?
2) Can any body help me, or point me on some help how to create Hector query which will bring all rows for server1 ( {server1:0}, {server1:1} {server1:2), etc...)? I saw a lot of example using CompositeType as comparator, but no example for key validator.
Any help or comment is highly appreciated.
First of all, row oversizing shouldn't be a problem in cassandra. Despite that, it might worth to spilt rows, since data distribution across cluster will be more even in this situation.
ByteOrderedPartitioner doesn't look like a good option here, since it would be hard to achieve uniform distribution of rows across cluster, that will lead to hotspots.
There's no way to query range of keys when using RandomPartitioner. However, if the maximum N value is reasonably small (up to 256) MultigetSliceQuery might be used to query whole set of rows.