I have an events table that includes a column for IP address. I also have a mapping table that takes 0-padded IP address ranges and has a corresponding city, country.
I can write the query that converts an IP address to a 0-padded IP address, and then joins against my mapping table.
But performance-wise, I can't get anything to work. The events table is 40m+ rows, so trying a join based on a field manipulation takes 30 minutes before I pull the plug. I tried mapping a mapping table (IP to 0-padded IP) but it did not improve things. Even building a reduced table of unique IP addresses with city is taking forever.
Is there an approach or strategy I should take here to try to get merge these datasets more efficiently?
You don't have to do this on the fly every time. A good approach is to materialize. Especially since your data is likely immutable (IP is not changed once it's recorded for whatever data is in the row), you have more options. They can be:
run a script every night to rerun the whole join
run incremental appends, so every new portion of data, append the join for every new portion of data that comes into the source table
If you can afford storing another copy of 40m+ table you can just store the result as a new table that has all columns from the source table plus county (and maybe 0-padded IP), if not you can store a table with some ID from the source table plus join results that provides a faster join than the original query.
Or maybe if the source table is a result of some ETL/ELT process that can just be another step.
Related
I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.
I am asking for help on the following topic. I am trying to create an ETL process with two Excel data sources (S1 ~300 rows and S2 ~7000 rows). S1 contains project information and employee details and S2 contains the amount of hours, which each employee worked in which project at a timestamp.
I want to insert the amount of hours, which each employee worked in each project at a timestamp, into the fact table by referencing to the existing primary keys in the dimension tables. If an entry is not present in the dimension tables already, i want to add a new entry first and use the newly generated id. The destination table structure looks as follows (Data Warehouse, Star Schema):Destination Table Structure
In SSIS, i created three Data Flow tasks for filling the Dimension Tables (project, employee and time) with distinct values (using group by, as S1 and S2 contain a lot of duplicate rows)first, and a fourth data flow task (see image below) to insert the FactTable data, and this is where I'm running into problems:
Data Flow Task FactTable
I am using three LookUp functions to retrieve the foreignKeys project_id, employee_id and time_id from the Dimension tables (using project name, employee number and timestamp). If the id is found, it is passed on all the way to Merge Join 1, if not, a new Dimension Entry is created (lets say project) and the generated project_id passed on instead. Same goes for employee and time respectively.
There is two issues with this:
1) The "amount of hours" (passed by Multicast four, see image above) is not matched in the final result (No Match)
2) The amount of rows being inserted keeps increasing forever (Endless Join, I belive due to the Merge joins).
What I've tried:
I have used one UNION instead of three Merge Joins before, but this resulted in the foreign keys being in seperate rows each, instead of merged together.
I used Merge (instead of Merge Join) and combined the join as well as sort conditions in as I fell all possible ways.
I understand that this scenario might be confusing for everybody else, but thank your for taking time looking at it! Any help is greatly appreciated.
Solved it
For anybody having similar issues:
Seperate Data Flows for filling Dimension Tables with those filling Fact Tables will do the trick.
Its a clean solution and easier to debug.
Also: Dont run the LookUp Functions in parallel, but rather one after each other and pass on the attributes. Saves unnecessary Merges as well.
So as a Sum Up:
Four Data Flow Tasks, three for filling dimension tables ONLY and one for filling fact tables ONLY.
Loading Multiple Tables using SSIS keeping foreign key relationships
The answer posted by onupdatecascade is basically it.
Good luck!
I have a talend job that i require a lookup at the target table.
Naturally the target table is large (a fact table) so I don't want to have to wait to load the whole thing before going to running lookups like this picture below:
Is there a way to have the lookup work DURING the pull from the main source?
The attempt is to speed up the inital loads so things move fast, and attempt to save on memory. as you can see, the lookup is already passed 3 Million rows.
the tLogRow represents the same table as the lookup.
You can achieve what you're looking for by configuring the lookup in your tMap to use "Reload at each row" lookup model, instead of "Load Once". This lookup model allows you to reexecute your lookup query for each incoming row, instead of loading all your lookup table at once, useful for lookups on large tables.
When you select the reload at each row model, you will have to specify a lookup key in the global map sections that will appear under the settings. Create a key with a name like "ORDER_ID", and map it with FromExt.ORDER_ID column. Then modify your lookup query so that it returns a single match for the ORDER_ID like so:
"SELECT col1, col1.. FROM lookup_table WHERE id = '" + (String)globalMap.get("ORDER_ID") + "'".
This is supposing your id column is a string.
What this does is create a global variable called "ORDER_ID" containing the order id for every incoming row from your main connection, then executes the lookup query filtering for that id.
I have a redshift cluster with a single dc1.large node. I've got data writing into it, on order of 50 million records a day, in the format of a timestamp, a user ID and an item ID. The item ID (varchar) is unique, the user ID (varchar) is not, and the timestamp (timestamp) is not.
In my redshift DB of about 110m records, if I have a table with no sort key, it takes about 30 seconds to search for a single item ID.
If I have a table with a sort key on item ID, I get a single item ID search time of about 14-16 seconds.
If I have a table with an interleved sort key with all three columns, the single item ID search time is still 14-16 seconds.
What I'm hoping to achieve is the ability to query for the records of thousands or tens of thousands of item IDs on order of a second.
The query just looks like
select count(*) from rs_table where itemid = 'id123';
or
select count(*) from rs_table where itemid in ('id123','id124','id125');
This query comes back in 541ms
select count(*) from rs_table;
AWS documentation suggests that there is a compile time for queries the first time they're run, but I don't think that's what I'm seeing (and it would be not ideal if it was, since each unique set of 10,000 IDs might never be queried in exactly the same order again.
I have to assume I'm doing something wrong with either the sort key design, the query, or some combination of the two - for only ~10g of table space, something like redshift shouldn't take this long to query, right?
Josh,
We probably need a few additional pieces of information to give you a good recommendation.
Here are some things to start thinking about.
Are most of your queries record lookups as you describe above?
What is your distribution key?
Do you join this table with other large fact tables?
If you load 50M records per day and you only have 110M records in the
table, does that mean that you only store 2 days?
Do you do massive deletes and then load another 50M records per day?
Do you run ANALYZE after your loads?
If you deleted a large number of records, did you run VACUUM?
If all of your queries are similar to the ones that you describe, why are you using Redshift? Amazon DynamoDB or MongoDB (even Cassandra) would be great database choices for the types of queries that you describe.
If you run analytical workloads Redshift is an excellent platform. If you are more interested in "record lookups" the NoSQL options, as well as mysql or MariaDB might give you better performance.
Also, if this is a dev/test environment and you have loaded and deleted large amounts of data without ever running a VACUUM you would see significant performance degradation.
I am moving from mysql to hbase due to increasing data.
I am designing rowkey for efficient access pattern.
I want to achieve 3 goals.
Get all results of email address
Get all results of email address + item_type
Get all results of particular email address + item_id
I have 4 attributes to choose from
user email
reverse timestamp
item_type
item_id
What should my rowkey look like to get rows efficiently?
Thanks
Assuming your main access is by email you can have your main table key as
email + reverse time + item_id (assuming item_id gives you uniqueness)
You can have an additional "index" table with email+item_type+reverse time+item_id and email+item_id as keys that maps to the first table (so retrieving by these is a two step process)
Maybe you are already headed in the right direction as far as concatenated row keys: in any case following comes to mind from your post:
Partitioning key likely consists of your reverse timestamp plus the most frequently queried natural key - would that be the email? Let us suppose so: then choose to make the prefix based on which of the two (reverse timestamp vs email) provides most balanced / non-skewed distribution of your data. That makes your region servers happier.
Choose based on better balanced distribution of records:
reverse timestamp plus most frequently queried natural key
e.g. reversetimestamp-email
or email-reversetimestamp
In that manner you will avoid hot spotting on your region servers.
.
To obtain good performance on the additional (secondary ) indexes, that is not "baked into" hbase yet: they have a design doc for it (look under SecondaryIndexing in the wiki).
But you can build your own a couple of ways:
a) use coprocessor to write the item_type as rowkey to separate tabole with a column containing the original (user_email-reverse timestamp (or vice-versa) fact table rowke
b) if disk space not issue and/or the rows are small, just go ahead and duplicate the entire row in the second (and third for the item-id case) tables.