I have 8 different tables with 24 million to 40 million records each. One of these tables is the master table that is used to join to the other 7.
My question is that while working with such large data sets, is it viable to use hash merge? I tried a hashing technique I learnt online but my system ran out of memory while loading the master table itself.
Are there any other efficient methods for merging large data sets in SAS?
Also, could anyone please help me with a snippet for merging these tables together. They're all merged with the master table based on different attributes.
Note : There are many to one merges in each scenario
Create index on these data.
Or divide the master table into smaller pieces, doing the Proc SQL for every single pieces then union them.
Related
I have a numerous amount of tables stored in memory in KDB. I am hoping to create an HDB of these tables so I can free up memory space. I am a bit confused on the process of creating an HDB - splaying tables, etc. Can someone help me with the process of creating an HDB, and then what needs to be done moving forward - ie to upload whatever new data I have end of day.
Thanks.
There's many ways to create a HDB depending on the scenario. General practices are:
For small tables, just write them as flat/serialised files using
`:/path/to/dbroot/flat set inMemTable;
or
`:/path/to/dbroot/flat upsert inMemTable;
The latter will add new rows while the former overwrites. However since you're trying to free up memory, using flat/serialised won't be all that useful since flat/serialised files will get pulled into memory in full anyway.
For larger tables (10's of millions) that aren't growing too much on a daily basis, you can splay them using set along with .Q.en (enumeration is required when the table is not saved flat/serialised):
`:/path/to/dbroot/splay/ set .Q.en[`:/path/to/dbroot] inMemTable;
or
`:/path/to/dbroot/splay/ upsert .Q.en[`:/path/to/dbroot] inMemTable;
again depending on whether you want to overwrite or add new rows.
For tables that grow on a daily basis and have a natural date separation, you would write as a date-partitioned table. While you can also use set and .Q.en for date partitioned tables (since they are the same as splayed tables, just separated into physical date directories) the easier method might be to use .Q.dpft or dsave if you're using a recent version of kdb. These will do a lot of the work for you.
It's up to you then to maintain the tables, ensure the savedowns occur on a regular basis (usually daily), append to tables if necessary etc etc
Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?
Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.
Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?
I have 3 database tables, each contain 6 million rows and adding 3 million rows every year.
Following are the table information:
Table 1: 20 fields with 50 characters average in each filed. Has 2 indexes both are on timestamp fields.
Table 2: 5 fields, 2 byte array field and 1 xml field
Table 3: 4 fields, 1 byte array field
Following is the usage:
Insert 15 to 20 records per second in each table.
A view is created by joining first 2 tables and the select is mostly based on the date field in the first table.
Right Now, insert one record each in all three table together takes about 100 milliseconds.
I'm planning to migrate from postgres 8.4 to 9.2. I would like to do some optimization for insert performance also.Also, I'm planning to create history tables and keep the old record into those tables. I have the following questions in this regard
Will create history tables and move older data to those tables help in increasing the insert performance?
If it helps, how often I need to move the old records into the history tables, daily? or weekly/monthly/yearly?
If i keep only one month (220,000) data instead of one year data (3 million) will it help in improving insert performance?
Thanks in advance,
Sudheer
I'm sure someone better informed than I will show up and provide a better answer, but my impression is that:
Insert performance is mostly a function of your indexing strategy and your hardware
Performance, in general, is better under 9.0+ than 8.4, and this may rub off on insert performance, but I'm not certain of that.
None of your ideas are going to directly affect insert performance
Now, that said, the cost of maintaining a small index is lower than a large one, so it may be that creating history tables and moving old data there will improve performance simply by reducing index pressure. But I would expect dropping one of your indexes to have a direct and greater effect. Perhaps you could have a history table with both indexes and just maintain one of them on the "today" table?
If I were in your shoes, I'd get a copy of production going on my machine running 8.4 with a similar configuration. Then upgrade to 9.2 and see if the insert performance changes. Then try out these ideas and benchmark them, see which ones improve the situation. It's absolutely essential that things be kept as similar to production as possible for this to yield useful information, but it will certainly be better information than any hypothetical answer you might get.
Now, 100ms seems pretty slow for inserting one row IMO. Better hardware would certainly improve this situation. The usual suggestion would be a big striped RAID array with a battery-backed cache. PostgreSQL 9.0 High Performance has more information on all of this.
I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.
I'm developing an iPhone app with two SQLite databases. I'm wondering if I should be using only one.
The first database has a dozen small tables (each having four or five INTEGER or TEXT columns). Most of the tables have only a few dozen rows, but one table will have several thousand rows of "Items." The tables are all related so they have many joins between them.
The second database contains only one table. It has a BLOB column with one row related to some of the "Items" from the first database (photos of the items). These BLOBs are only used in a small part of the program, so they are rarely attached or joined with the first database (rows can be fetched easily without joins).
The first database file size will be about a megabyte; the db with the BLOBs will be much larger -- about 50 megabytes. I separated the BLOBs because I wanted to be able to backup the smaller database without having to carry the BLOBs, too. I also thought separating the BLOBs might improve performance for the smaller tables, but I'm not sure.
What are the pros and cons of having two databases (SQLIte databases in particular ) to separate the BLOBs, vs. having just one database with a table full of very narrow tables mixed in with the table of BLOBs?
Even at 50 MB, that is a very small database size, and you're not going to see a performance difference due to the size of the database itself.
However, if you think performance is the issue, check your DB queries, and make sure you're not 'SELECT'ing the BLOB rows in queries where you don't need that information. Returning more rows from the database than you need (think of it as reading more data from a hard drive than you need to), is where you're much more likely to see performance issues.
I don't see any pros to separating them, unless you are under a very tight restriction on the total size of the backup. It's just going to add complexity in weird places in your app.