I want to be able to add up to 10 tags to a record in a database (MongoDB) but I don't want to add 10 columns with the related indexes on each of them. So I thought I'd add the unique sum of these tags.
e.g. (with 6 tags)
|------------|
|value | tag |
|------------|
|1 |a |
|2 |b |
|4 |c |
|8 |d |
|16 |e |
|32 |f |
|------------|
e.g.
a + b = 3
b + c + d = 14
I then store just the sum of the value in Mongo.
These combinations are always unique and I can "reconstitute" them back into tags when pulled from persistent storage using iteration.
int tagSum
for each (tag in tagCollection.OrderDescending)
{
if (tagSum >= (int)tag)
{
TagProperty.Add(targetAge);
tagSum -= (int)tag;
}
}
My problem however is that I thought there must be a mathematical formula that I could use to query for a specific tag e.g. find the "c tag" by passing in the value 4. I'm either wrong or I cannot find it.
I'm happy to go with the Multikeys solution in Mongo but I have a lot of other data to index and using 1 index instead of 10 would just be nicer‽
Multikeys is the right approach to this problem as it can find any document from the single index on the array without a table-scan. In your case you can just put the appropriate selection of letters representing the tags into the array: ["a", "d", "e"].
In more complicated cases where each field could contain the same tag values, for example song names, album names, artist names, ... I sometimes add the tag phrase twice: once on its own and once with the field name pre-pended, e.g. "artist:Hello". Now I can search on the tag word occurring in any field OR the tag word occurring in a specific field and either way it will use the index to find matching records.
Convert the tags a-f to an integer 0-5, call it tagValue, then the number you want is 1<<tagValue.
Related
I would like to delete or update values in nested dictionary
Eg.
d:`date`tab`col!((2022.12.01;2022.12.03);`TRADE`SYM;`ID`CODE`PIN`NAME)
I would like to update `PIN to `Yen or maybe delete `PIN and `CODE from the dictionary.
I think this may be slightly fiddly due to the nested nature but replacing values could be done with a dictionary and fills. This would replace all instances of PIN if there were multiple.
#[d;`col;{x^(enlist[`PIN]!enlist`YEN) x}]
date| 2022.12.01 2022.12.03
tab | `TRADE`SYM
col | `ID`CODE`YEN`NAME
Deletions could be done with except.
q)#[d;`col;except[;`PIN`CODE]]
date| 2022.12.01 2022.12.03
tab | TRADE SYM
col | ID NAME
I wouldn't be surprised to find better ways to do both these actions.
You could do something like:
q)#[d;`col;{x where not x in`CODE`PIN}]
date| 2022.12.01 2022.12.03
tab | TRADE SYM
col | ID NAME
Very minor mods to #Thomas Smyth-Treliant:
#[d;`col;] {x^(.[!]1#'`PIN`Yen) x} / (1)
#[d;`col;] except[;`PIN`Yen] / (2)
You could also use amend in an implicit function to update nested values in a dictionary:
{.[d;(`col;x);:;`your`update]} where d[`col] in `ID`PIN
Output:
date| 2022.12.01 2022.12.03
tab | `TRADE`SYM
col | `your`CODE`update`NAME
I need to create a dataframe that concatenates many columns, but each element that I concatenate has to have a certain number of values to follow a specific layout.
For example, I need to concatenate name and last name, but the name always has to have 50 characters, even if the name is shorter, to follow a certain require layout. I should add spaces to cover up for the missing letters in the name.
I'm using this code, but it is not giving me the desired result:
df.select(concat(rpad($"FirstName", 50, " "), rpad($"LastName", 50, " ")))
Does someone have any tips on how I can do this?
I think your solution is correct. Please set truncate = "false" in show method in case you are using spark-shell to check it visually.
Here is test DF data:
scala> df.show
+---------+--------+
|FirstName|LastName|
+---------+--------+
| John| Smith|
| James| Bond|
+---------+--------+
Now if we print concatenation result and add length as separate column we get exactly the sum of lengths:
scala> df
.select(concat(rpad($"FirstName", 50, " "), rpad($"LastName", 50, " "))
.as("concat"))
.withColumn("lenght", length(col("concat")))
.show(100, false)
+----------------------------------------------------------------------------------------------------+------+
|concat |lenght|
+----------------------------------------------------------------------------------------------------+------+
|John Smith |100 |
|James Bond |100 |
+----------------------------------------------------------------------------------------------------+------+
How can I use Group By with FileMaker? Kind of similar problem like this
Filemaker sum by group. Can someone explain how to use Summary GetSummary?
Id |Value
1 |50
1 |50
2 |10
2 |5
1|100
2|15
Create a new summary field (ValueSummary) for Value field.
Sort the records by ID field.
Use GetSummary ( ValueSummary; ID ) to get the summary of a particular ID value.
In my problem, there is a data stream of information about package delivery coming in. The data consists of "NumberOfPackages", "Action" (which can be either "Loaded", "Delivered" or "In Transit"), and "Driver".
val streamingData = <filtered data frame based on "Loaded" and "Delivered" Action types only>
The goal is to look at number of packages at the moment of loading and at the moment of delivery, and if they are not the same - execute a function that would call a REST service with the parameter of "TrackingId".
The data looks like this:
+-----------------+-----------+-----------------------
|NumberOfPackages |Action |TrackingId |Driver |
+-----------------+-----------+-----------------------
|5 |Loaded |a |Alex
|5 |Devivered |a |Alex
|8 |Loaded |b |James
|8 |Delivered |b |James
|7 |Loaded |c |Mark
|3 |Delivered |c |Mark
<...more rows in this streaming data frame...>
In this case, we see that by the "TrackingId" equal to "c", the number of packages loaded and delivered isn't the same, so this is where we'd need to call the REST api with the "TrackingId".
I would like to combine rows based on "TrackingId", which will always be unique for each trip. If we get the rows combined based on this tracking id, we could have two columns for number of packages, something like "PackagesAtLoadTime" and "PackagesAtDeliveryTime". Then we could compare these two values for each row and filter the dataframe by those which are not equal.
So far I have tried the groupByKey method with the "TrackingId", but I couldn't find a similar example and my experimental attempts weren't successful.
After I figure out how to "merge" the two rows with the same tracking id together and have a column for each corresponding count of packages, I could define a UDF:
def notEqualPackages = udf((packagesLoaded: Int, packagesDelivered: Int) => packagesLoaded!=packagesDelivered)
And use it to filter the rows of the dataframe to contain only those with not matching numbers:
streamingData.where(notEqualPackages(streamingData("packagesLoaded", streamingData("packagesDelivered")))
I have a very simple job with 4 'contact' records, of which 2 of them have an identical email address
Now I try to find the records have an identical email record. So I load the contact records twice, then both attach them to a tmap, and use a lookup to match on emailaddress. Using filter expressions, I ensure that I don't compare records with themselves.
The result now is that only 1 of the duplicate emails is marked as 'duplicate' and the other records is NOT matched. Does anybody have an idea why?
This is because :
The Unique match option functions as a Last match. The First match and
All matches options function as named.
So if we remove the input filter row1.id!=row2.id and just left join the 2 flows and show them, we will get:
|=-+------------------+----+-----------------=|
|id|mail |id_1|mail_1 |
|=-+------------------+----+-----------------=|
|c1|some#mail.com |c1 |some#mail.com |
|c2|other#mail.com |c2 |other#mail.com |
|c3|identical#mail.com|c4 |identical#mail.com|
|c4|identical#mail.com|c4 |identical#mail.com|
'--+------------------+----+------------------'
Note that last 2 rows of the lookup flow does not have the row c3, because Talend fetched the last row that match identical#mail.com which is c4.
Now if we filter that by row1.id!=row2.id we will get only the third row which is what you have got:
|=-+------------------+-----------=|
|id|mail |isDuplicated|
|=-+------------------+-----------=|
|c1|some#mail.com |false |
|c2|other#mail.com |false |
|c3|identical#mail.com|true |
|c4|identical#mail.com|false |
'--+------------------+------------'
What we can do using only one tMap is to obtain all unique mail rows and all occurance of duplicated rows by enabling all match option.
|=-+------------------+----=|
|id|mail |isDup|
|=-+------------------+----=|
|c1|some#mail.com |false|
|c2|other#mail.com |false|
|c3|identical#mail.com|false|
|c3|identical#mail.com|true |
|c4|identical#mail.com|true |
|c4|identical#mail.com|false|
'--+------------------+-----'
Then we can filter this output to get duplicated rows in addition to the initial flow, to fill your exact requirement i dont think we are obliged to join this output again like this :
To get this output:
.--+------------------.
| unique |
|=-+-----------------=|
|id|mail |
|=-+-----------------=|
|c1|some#mail.com |
|c2|other#mail.com |
|c3|identical#mail.com|
|c4|identical#mail.com|
'--+------------------'
.--+------------------.
| duplicated |
|=-+-----------------=|
|id|mail |
|=-+-----------------=|
|c3|identical#mail.com|
|c4|identical#mail.com|
'--+------------------'
.--+------------------+------------.
| isDuplicated |
|=-+------------------+-----------=|
|id|mail |isDuplicated|
|=-+------------------+-----------=|
|c1|some#mail.com |false |
|c2|other#mail.com |false |
|c3|identical#mail.com|true |
|c4|identical#mail.com|true |
'--+------------------+------------'