Is Hadoop Suitable For This? - postgresql

We have some Postgres queries that take 6 - 12 hours to complete and are wondering if Hadoop is suited to doing it faster. We have (2) 64 core servers with 256GB of RAM that Hadoop could use.
We're running PostgreSQL 9.2.4. Postgres only uses one core on one server for the query, so I'm wondering if Hadoop could do it roughly 128 times faster, minus overhead.
We have two sets of data, each with millions of rows.
Set One:
id character varying(20),
a_lat double precision,
a_long double precision,
b_lat double precision,
b_long double precision,
line_id character varying(20),
type character varying(4),
freq numeric(10,5)
Set Two:
a_lat double precision,
a_long double precision,
b_lat double precision,
b_long double precision,
type character varying(4),
freq numeric(10,5)
We have indexes on all lat, long, type, and freq fields, using btree. Both tables have "VACUUM ANALYZE" run right before the query.
The Postgres query is:
SELECT
id
FROM
setone one
WHERE
not exists (
SELECT
'x'
FROM
settwo two
WHERE
two.a_lat >= one.a_lat - 0.000278 and
two.a_lat <= one.a_lat + 0.000278 and
two.a_long >= one.a_long - 0.000278 and
two.a_long <= one.a_long + 0.000278 and
two.b_lat >= one.b_lat - 0.000278 and
two.b_lat <= one.b_lat + 0.000278 and
two.b_long >= one.b_long - 0.000278 and
two.b_long <= one.b_long + 0.000278 and
(
two.type = one.type or
two.type = 'S'
) and
two.freq >= one.freq - 1.0 and
two.freq <= one.freq + 1.0
)
ORDER BY
line_id
Is that the type of thing Hadoop can do? If so can you point me in the right direction?

I think Hadoopis very apropriate for that, but consider using HBase too.
You can run a Hadoop MapReduceroutine to get data, treat it and save it in a optimal way to HBase table. That way, reading data from it would be really faster.

Try Stado at http://stado.us. Use this branch: https://code.launchpad.net/~sgdg/stado/stado, which will be used for the next release.
Even with 64 cores, you will only be using one core to process that query. With Stado you can create multiple PostgreSQL-based "nodes" even on a single box and leverage parallelism and get those cores working.
In addition, I have had success converting correlated not exists queries into WHERE (SELECT COUNT(*) ...) = 0.

Pure Hadoop isn't suitable because doesn't have indexes. HBase implementation is very tricky in this case because only one key is possible per table. Anyway they in best case both of them requires 5 servers at least to feel significant improvement. The best that you can do with PostgreSQL is to partition data per type column, use second server as replica of the first one and execute several queries in parallel for each particular type.
To be honest PostgeSQL isn't a best solution for that. The SybaseIQ(the best) or Oracle Exadata (in worse case) can do it much better because of columnar based data structure and BLOOM filtering.

Related

Convert T-SQL Cross Apply to Redshift

I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1

Generate non-fragmenting UUIDs in Postgres?

If I understand correctly, fully-random UUID values create fragmented indexes. Or, more precisely, the lack of a common prefix prevents dense trie storage in the indexes.
I've seen a suggestion to use uuid_generate_v1() or uuid_generate_v1mc() instead of uuid_generate_v4() to avoid this problem.
However, it seems that Version 1 of the UUID spec has the low bits of the ID first, preventing a shared prefix. Also, this timestamp is 60 bits, which seems like it may be overkill.
By contrast, some databases provide non-standard UUID generators with a timestamp in the leading 32-bits and then 12 bytes of randomness. See Datomic's Squuid's for example 1, 2.
Does it in fact make sense to use "Squuids" like this in Postgres? If so, how can I generate such IDs efficiently with pgplsql?
Note that inserting sequential index entries will result in a denser index only if you don't delete values and all your updates produce heap only tuples.
If you want sequential unique index values, why not build them yourself?
You could use clock_timestamp() in microseconds as bigint and append values from a cycling sequence:
CREATE SEQUENCE seq MINVALUE 0 MAXVALUE 999 CYCLE;
SELECT CAST(
floor(
EXTRACT(epoch FROM t)
) AS bigint
) % 1000000 * 1000000000
+ CAST(
to_char(t, 'US') AS bigint
) * 1000
+ nextval('seq')
FROM (SELECT clock_timestamp()) clock(t);

Combining traditional and spatial indices in Postgres

I have timestamped location data.
I want Postgres to efficiently execute queries that are bounded in time and space. e.g.
select *
from tracking_tags
where ts >= '1990-01-01T00:00:00.000Z'
and ts < '2000-01-01T00:00:00.000Z'
and lat > 40.0
and lat < 50.0
and long < 0.0
and long > -10.0
How should I approach this from an indexing point of view?
I am confused because I think I might need to choose between a normal b-tree index on ts and a GIST index on lat/long POINTs, but I need a composite index (or possibly two).
Assume a decade of data, with a thousand records per day.
(P.S. Apologies for nonsense SQL, I haven't yet switched from MySQL to Postgres - but this is a Postgres question.)
Indexes for this particular table schema could vary greatly depending on what what information you need to fetch.
For example, the query below would likely use the index effectively
CREATE INDEX ON tracking_tags USING gist (point(lat,long), ts);
SELECT *
FROM tracking_tags
WHERE point(lat,long) <# box(point(40,-10),point(50,0)) AND
ts <# tstzrange'[1990-01-01,2000-01-01)' AND
lat NOT IN (40, 50) AND long NOT IN (-10, 0);
The btree_gist extension allows you to make a gist index on timestamps which makes it possible to combine them with PostGIS indexes. PostgreSQL also can use multiple indexes in one query. You'll have to test and see which combination performs the best.

Cassandra efficient table walk

I'm currently working on a benchmark (which is part of my bachelor thesis) that compares SQL and NoSQL Databases based on an abstract data model an abstract queries to achieve fair implementation on all systems.
I'm currently working on the implementation of a query that is specified as follows:
I have a table in Cassandra that is specified as follows:
CREATE TABLE allocated(
partition_key int,
financial_institution varchar,
primary_uuid uuid,
report_name varchar,
view_name varchar,
row_name varchar,
col_name varchar,
amount float,
PRIMARY KEY (partition_key, report_name, primary_uuid));
This table contains about 100,000,000 records (~300GB).
We now need to calculate the sum for the field "amount" for every possible combination of report_name, view_name, col_name and row_name.
In SQL this would be quite easy, just select sum (amount) and group it by the fields you want.
However, since Cassandra does not support these operations (which is perfectly fine) I need to achieve this on another way.
Currently I achieve this by doing a full-table walk, processing each record and storing the sum in a HashMap in Java for each combination.
The prepared statement I use is as follows:
SELECT
partition_key,
financial_institution,
report_name,
view_name,
col_name,
row_name,
amount
FROM allocated;
That works partially on machines with lots on RAM for both, cassandra and the Java app, but crashes on smaller machines.
Now I'm wondering whether it's possible to achieve this on a faster way?
I could imagine using the partition_key, which serves also as the cassandra partition key and do this for every partition (I have 5 of them).
Also I though of doing this multithreaded by assigning every partition and report to a seperate thread and running it parallel. But I guess this would cause a lot of overhead on the application side.
Now to the actual question: Would you recommend another execution strategy to achieve this?
Maybe I still think too much in a SQL-like way.
Thank you for you support.
Here are two ideas that may help you.
1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.
First query (Q1):
select whatever_columns from allocated limit 1000;
Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk = sk_val and tk > tk_val limit 1000;
The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk > sk_val limit 1000;
Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):
select whatever_columns from allocated where token(pk) > token(pk_val) limit 1000;
Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....
You are done when Q4 returns less than 1000.
2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:
update amounts_table set amounts = amounts + whatever_amount where my_primary_key = four_col_values_delimited;
Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).
Sorry if my description is confusing. Please let me know if this helps.

Are there performance gains to using WHERE ST_IsValid?

I'm using PostgreSQL 9.2 with PostGIS 2.0.1 on Windows.
Consider a table some_table with a GEOMETRY column named geom.
Query 1:
UPDATE some_table
SET geom = ST_MakeValid(geom)
Query 2:
UPDATE some_table
SET geom = ST_MakeValid(geom)
WHERE NOT ST_IsValid(geom)
Does calling ST_IsValid as a filter (as in Query 2) offer any performance gains (over Query 1)?
Expanding on Craig's comment, the answer is "maybe." There are a lot of possible answers here and it depends on a lot of things.
For example, suppose 80% of your table is invalid and you care about that 20%. And now suppose that ST_IsValid takes 60% of the CPU time that ST_MakeValid does. You would run the ST_IsValid on all of your table (0.6 * 1) plus you would run the ST_MakeValid function on the other 20% (1 * 0.2). This would save you about 20% of your time without an index. If you had a functional index it might save you a bunch of time (the numbers are hypothetical of course).
Suppose on the other hand half your table was invalid. You'd run the cheaper function on all rows (0.6 * 1) and you would run the more expensive function on the other (1 * 0.5), leading to a net slowdown in your query by about 10%. This also means that if virtually all your rows are valid there can be no benefit performance-wise to checking.
So the answer is that you really need to check with EXPLAIN ANALYSE on your specific set.