Firstly please allow me to explain what I am trying to achieve. I have a found set of about 100 records, each record has a SKU number (serial) now out of the 100 records, some records have the same SKU, so of course there is less than 100 unique SKUS.
I want to know how many times a SKU appears within the found set. NOT the total number of unique SKUS, but more like how many times each SKU appears individually.
So for example I could have the SKU - 123456 - which appears twice in the found set, so the value for that should be 2, as there are 2 instances of that SKU in the found set.
So just to reiterate, I do not want the total number of unique SKU's in the found set, but more to know how many times each individual SKU appears within the found set.
I have tried many things but keep ending up with the total unique values which is absolutely no use to me.
Thanks
Building on michael.hor257's idea, you can use the GetSummary function to achieve your desired result.
Sample field definition:
After sorting your found set on SKU, this will produce the following result:
use ExecuteSQL filemaker native function to achieve this easily. here is FileMaker article that gives a very good insight:
http://help.filemaker.com/app/answers/detail/a_id/3423/~/counting-the-number-of-unique-values-in-a-field
Related
I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!
I have volume data for specific customers. The customer names come from salesforce and the volume comes from another table. When I add each in tableau, i get a nice table that seems to be working.
We can see that there are 19 values ~500 My ultimate goal is to sum these based upon filters.
A way i discovered that i can do that is to use the syntax
{ FIXED [Account Id]: count([Volume]) }
But when i do that,
I get
When I change my function to count([volume]) i get a count of all joined rows ~250k
My question is how do i make this respect indivudal entries in the database and not all the joined values? If there was a way to do the sum for distinct timestamps in another field this would also work? Any other advice would be helpful from you tableau experts.
Thanks!
I think i got it. In the table of the database that i was trying to calculate there were 20 rows that needed to be calculated. When the data was joined in SF, it duplicated the rows. The trick here was to do the sum of the max for each primary key
SUM({ FIXED [Pk], [Name1] : MAX([Volume]) })
I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)
I'm working in Pentaho 4.4.1-GA (Kettle / PDI). The database is Postgres.
I need to be able to insert multiple records into a fact table based on the fields that come from a single record. The single record contains fields:
productcode1, price1
productcode2, price2
productcode3, price3
...
productcode10,price10
So if there was a value for each of the 10 productcode / prices then I'd need to insert a total of 10 records into the fact table. If there were values for 4 of the combinations, then I'd need to insert 4 records into the fact table, etcetera. All field values for the fact records would be identical except for the PK (generated by sequence), product codes, and prices.
I figure that I need some type of looping construct which would let me check whether or not a value was present for each productx field, and if so, do an insert/update step on the fact table with the desired field values. I'm just not sure how to do this in Pentaho.
Any ideas? All suggestions are welcome :)
Thank You,
Rakesh
Could you give a sample input and output for your scenario??
From your example data I can infer that if there are 10 different product codes and only 4 product prices you want to have 4 records inserted into your table. Is that so?
Well for a start you can add a constant value of 1 to those records by filtering for NOT NULL and then use an Group BY Step to count the number of 1's. This would give you the count. BTW it would be helpful if you could provide more details on what columns you would be loading as there are ways to make a PDI transformation execute multiple times
I have the following issue: I have a report that uses a Dataset as its datasource. The dataset has two tables, one would be the main table, say Employee, and the second table is EmployeePaycheck, so an employee can have several paychecks. I can compute the sum of a column in the second table, say paycheckValue, but what I can't seem to do is also add to this computed field the value of some additional fields in the Employee table, such as ChristmasBonus or YearlyBonus, to see how much the employee was paid at the end of the year.
Without knowing more information on this it will be difficult to answer, but I'll give you a couple things to look for.
First, I would make sure that the fields are of a similar type that will allow for a summary. For example, if one is a string then a summary wouldn't be able to be done without casting or convertingthe value to a number. I'm assuming that the fields are probably number or decimal columns so that is probably not the case.
I'd also check to make sure that none of the values that you are trying to sum are null. I haven't tested this, but I believe that it will not sum correctly if one of the rows has a null value. In this scenario you should just be able to use a formula field to check for the null and if the field is null return 0 instead. Then you can use the formula field in your calculations instead of the field itself.
If neither of these are the case please provide a little more info how you are computing the fields and what is happening when you do it.
Hope this helps.