FileMaker database design with calculated fields and filtering - filemaker

I am trying out Filemaker Pro 12 right now with no previous FM experience, although other basic DB experience. The issue I have is trying to do filtered queries for a report that span one-to-many relationships. Here is an example;
The 2 tables:
Sample_Replicate
PK
Sample FK
other fields
Weights
Sample_Replicate_FK (linked to PK of Sample_Replicate)
Weight
Measurement type (tare, gross, dry, ash)
Wash type (null or from list of lab assays)
I want to create a report that displays: (gross-tare), (dry-tare)/(gross-tare), (ash-tare)/(gross-tare), and (dry-tare)/(gross-tare) for all dry weights with non null wash types.
It seems that FM wants me to create columns for each of these values (which is doable as the list of lab assays changes minimally and updating the database would be acceptable, though not preferred). I have tried to add a gross wt, tare wt, etc to the Sample_Replicate table, but it only is returning the first record (tare wt) when I use calculated field and method:
tare wt field = Case ( Weights::Measurement type = "Tare"; Weights::Weights )
gross wt field = Case ( Weights::Measurement type = "Gross"; Weights::Weights )
etc...
It also seems to be failing when I add the criteria:
and Is Empty(Weights::Wash type )
Could someone point me in the right direction on this issue. Thanks
EDIT:
I came across this: http://www.filemakertoday.com/com/showthread.php/14084-Calculation-based-on-1-to-many-relationship
It seems that I can create ~15 calculated fields for each combination of measurement and wash type on the weights table, then do a sum of these columns in the sample_replicate after adding these 15 columns to the table. This seems absolutely asinine. Isn't there a better way to filter results of a one-to-many relationship in FM?

What about the following structure:
Replicate
ID
Wash Weight
Replicate ID
Type (null or from list of lab assays)
Tare
Gross
Dry
Ash
+ calculated fields
I assume you only calculate weight ratios of the same wash type. The weight types (tare, gross, etc.) are not just labels here; since you use them in formulas in specific places, they are more like roles, so I think they deserve their own fields.

add tare wt field, etc. in the Weights table but then add a calc field in your Sample_Replicate table to get the sum of all related values
ex: add field "total tare wt" to be "sum ( Weights::tare wt)"

Related

How can I best construct data structures to retrieve similar values for demographic matching?

The job is person demographic matching/consolidation.
I have incoming person demographic information which I need to determine if it is a match against an existing person in the a dataset. I get the following data;
NAME_LAST VARCHAR2(40),
NAME_FIRST VARCHAR2(40),
NAME_MIDDLE VARCHAR2(40),
NAME_MAIDEN VARCHAR2(40),
RESIDENCE_ADDRESS VARCHAR2(60),
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9),
RACE VARCHAR2(2),
DATE_OF_BIRTH DATE,
GENDER VARCHAR2(1),
TELEPHONE VARCHAR2(10),
SSN VARCHAR2(9)
The incoming and existing data can and does have typographic errors in any/all fields. I have written a probabilistic algorithm which will take an existing record, incoming record and score their similarity reasonably well (99.99%+).
The problem is performance. The match of two records is reasonably quick, but the dataset I need to match against currently has over 3.9 million rows. So obviously I can't try to match against all records in the dataset.
The common way around this is to limit searches using deterministic matches against limited subsets of the data (blocking). Soundex and double metaphone "hashing" is used on name fields, DOB is split into year and MMDD segments, and this blocking yields good results but unless I cast a wide net, I miss some matches. If I cast a wide net, the performance degrades.
So the questions are;
What types of "hashing" can I do, other than double metaphone & soundex, on the data elements which would be suitable for exact or range matching which would yield small subsets of data likely to contain the "best" match?
Is there a better approach to creating a suitable data structure for matching?
The data is contained in an Oracle DB 19c the main language at my disposal is PL/SQL.
You should either add your algorithm that makes a reasonable score or add additional information - against what input you should match.
For example:
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9)
Should either not contains errors or those errors could be much easier detected and corrected.
In this case you can create index on these three columns and run your algorithm on those that matches exact (or matches after correction) these three columns.
So my suggestion would be - to divide original data on smaller groups that can be matched more precisly and then run you algorithm based on this smaller group.

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Filemaker summarize related field if

I have a FM13 DB with a table "machines" and its related table "consumption" connected with the IDs.
My layout is showing the machines data and a portal with the related consumption entries.
Now I want to summarize the fields "amount" of all related entries where "fuelType" is "diesel" and the "year" is "2015" into one calculted field within the machines table.
Can anyone give me a clue how to do that?
thx
dan
In addition to the ways Michael suggested, FM13 introduced the ExecuteSQL command, which can used in a calculated field. The calculation would look something like this:
ExecuteSQL (
"SELECT SUM(amount)" & ¶ &
"FROM consumption" & ¶ &
"WHERE FuelType = 'diesel' AND \"Year\" = 2015" ;
"" ; ""
)
Create another relationship based on a calculation field in both tables based on concatenate fields:
id_fueltype_year: 12345_diesel_2015
You can then create a Calculation field in Machines containing Sum(newrelation: amount)
If you hard code the fueltype in the calc field machines table, then the result will always be the specified fuel type (e.g. diesel). Thus you would need to create multiple fields for each field type. Or you could create a fuel type field and as you change it the summary calc field would update upon selection.
If this is for display only, you could simply create a one-row portal based on the existing relationship and filter it to show only records where =
Consumption::FuelType = "diesel" and Consumption::Year = 2015
or (preferably, IMHO) =
Consumption::FuelType = Machines::gFuelType
and
Consumption::Year = Machines::gYear
with Machines::gFuelType and Machines::gYear being global fields where you can select any type/year to summarize.
Place a summary field defined (in the Consumption table) as Total of [Amount] inside the filtered portal.
If you need the result as data for further processing, then you will need to add a dedicated relationship (using another occurrence of the Consumption table) as:
Machines::MachineID = Consumption 2::MachineID
AND
Machines::gFuelType = Consumption 2::FuelType
AND
Machines::gYear = Consumption 2::Year
and use Sum ( Consumption 2::Amount ) to summarize the relevant entries.

ETL Process when and how to add in Foreign Keys T-SQL SSIS

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

simplest example of a query by date range in cassandra 1.x

I want to store an ID and a date and I want to retrieve all entries from dateA up to dateB, what exactly do I need to be able to perform select from my_column_family where date >= dateA and date < dateB; ?
the guys at #cassandra (IRC) helped me find a way, there's many subtle details so I'd like to document that here.
first you need to declare a column family similar to this (examples from cassandra-cli):
create column family users with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[
{column_name: id, validation_class: LongType}
{column_name: name, validation_class: UTF8Type, index_type: KEYS}
{column_name: age, validation_class: LongType}
];
few important things about this declaration:
the comparator and key_validation_class are there to be able to use strings as key names
the first declared column is special, it's the "row key" which is used to address each row and therefore cannot contain duplicate values (the INSERT is really an UPSERT so when there's duplicates the new values overwrite the old ones)
the second column declares a "secondary index" on its values (more on that below)
the dates are stored as Long datatypes, interpretation is up to the client
now let's add some values:
set users[1][name] = john;
set users[1][age] = 19;
set users[2][name] = jane;
set users[2][age] = 21;
set users[3][name] = john;
set users[3][age] = 32;
according to this: http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/ Cassandra does not support the < operators, what it does is to manually exclude the rows that don't match but it does that AFTER there's a resultset and it also refuses to do so unless and actual filtering has taken place.
what that means is that a query like get users where age > 20; will return null but if we add a predicate that includes = it'll magically work.
here's where the secondary index is important, without it you can't use = so on this example I can do get users where name = jane; but I cannot ask for get users where age = 21;
the funny thing is that, after using = the < works so having a secondary index allows you to ask for get users where name = john and age > 20; and it'll filter correctly.
There are a few ways to solve this. The simplest is probably the secondary index solution with the equality limitation mentioned in your own answer. I've used this method, adding an additional column called 'valid', setting the value to 1. Then the queries can become where valid=1 and date>nnnn
The other solutions require additional column families and additional queries.
When loading the data, create and add to a column family which contains the timestamps as keys, and each entry would list all the user ids as column names.
If the partitioning strategy is ordered, then a single RangeSliceQuery can specify the date range as a key range and get all the columns for each key. Then iterate through the result keys, using the column values for each user id and if needed, query the original column family for the data associated with each id. Cassandra always stores the column names sorted, and can be reversed when reading.
But, as documented, the ordered partitioner is not ideal, leading to hot spots and difficulty in load balancing the nodes.
Without the ordered partitioner, still keeping the timestamp column family, you would have to create another column family while loading data where you can store all the timestamps as the columns under one or more known keys (e.g. 'created' or 'updated'). The first query would be a SliceQuery for a known key, and then the column names (as timestamps) would provide the keys for the MultigetSliceQuery to the timestamp column family.
I've used variations on this, usually adding Composite keys or columns for additional flexibility.