hadoop - large database query - postgresql

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job.
From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up:
1) Limit the job to only run 1 mapper that queries the whole table and call it
good.
or
2) Somehow incorporate an offset in the query so that if Hadoop does try to use
a new mapper it won't grab the same stuff.
I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure.
Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.

DBInputFormat essentially does already do your option 2. It does use LIMIT and OFFSET in its queries to divide up the work. For example:
Mapper 1 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100
Mapper 2 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 100
Mapper 3 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 200
So if you have proper indexes on the key field, you probably shouldn't mind that multiple queries are being run. Where you do get some possible re-work is with speculative execution. Sometimes hadoop will schedule multiples of the same task, and simply only use the output from whichever finishes first. If you wish, you can turn this off with by setting the following property:
mapred.map.tasks.speculative.execution=false
However, all of this is out the window if you don't have a sensible key for which you can efficiently do these ORDER, LIMIT, OFFSET queries. That's where you might consider using your option number 1. You can definitely do that configuration. Set the property:
mapred.map.tasks=1
Technically, the InputFormat gets "final say" over how many Map tasks are run, but DBInputFormat always respects this property.
Another option that you can consider using is a utility called sqoop that is built for transferring data between relational databases and hadoop. This would make this a two-step process, however: first copy the data from Postgres to HDFS, then run your MapReduce job.

Related

Good way to ensure data is ready to be queried while Glue partition is created?

We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.

Postgres: Count all INSERT queries executed in the past 1 minute

I can do currently active count of all INSERT queries executed on the PostgreSQL server like this:
SELECT count(*) FROM pg_stat_activity where query like 'INSERT%'
But is there a way to count all INSERT queries executed on the server in a given period of time? E.g. in the past minute?
I have a bunch of tables into which I send a lot of inserts and I would like to somehow aggregate how many rows I am inserting per minute. I could code a solution for this, but it'd be so much easier if this was possible to somehow extract directly from the server.
Any type of stats like this, in a certain period of time, would be very helpful, an average time it takes for the query to process, or knowing the bandwidth that goes through per minute, etc.
Note: I am using PostgreSQL 12
If not already done, install pg_stat_statements extension and take some snapshots of the view pg_stat_statements: the diff will give the number of queries executed between 2 snapshots.
Note: It doesn’t save each individual query, rather it parameterizes them and then saves the aggregated result.
See https://www.citusdata.com/blog/2019/02/08/the-most-useful-postgres-extension-pg-stat-statements/
I believe that you can use the audit trigger.
This audit will create a table that register INSERT, UPDATE and DELETE actions. So you can adapt. So every time that your database runs one of those commands, the audit table register the action, the table and the time of the action. So, it will be easy to do a COUNT() on desired table with a WHERE from a minute ago.
I couldn't come across anything solid, so I have created a table where I log a number of insert transactions using a script that runs as a cron job. It was simple enough to implement and I do not get estimations, but the real values instead. I actually count all new rows inserted to tables in a given interval.

Postgres table partitioning based on table name

I have a table that stores information about weather for specific events and for specific timestamps. I do insert, update and select (more often than delete) on this table. All of my queries query on timestamp and event_id. Since this table is blowing up, I was considering doing table partitioning in postgres.
I could also think of having multiple tables and naming them "table_< event_id >_< timestamp >" to store specific timestamp information, instead of using postgres declarative/inheritance partitioning. But, I noticed that no one on the internet has done or written about any approach like this. Is there something I am missing?
I see that in postgres partitioning, the data is both kept in master as well as child tables. Why keep in both places? It seems less efficient to do inserts and updates to me.
Is there a generic limit on the number of tables when postgres will start to choke?
Thank you!
re 1) Don't do it. Why re-invent the wheel if the Postgres devs have already done it for you by providing declarative partitioning
re 2) You are mistaken. The data is only kept in the partition to which it belongs to. It just looks as if it is stored in the "master".
re 3) there is no built-in limit, but anything beyond a "few thousand" partitions is probably too much. It will still work, but especially query planning will be slower. And sometime the query execution might also suffer because runtime partition pruning is not as efficient any more.
Given your description you probably want to do hash partitioning on the event ID and then create range sub-partitions on the timestamp value (so each partition for an event is again partitioned on the range of the timestamps)

Get Unique rows in SELECT query using JPA in POSTGRESQL

I am working in a Spring batch application in Spring boot which will be running in two different instances, where I have a scenario in which I have to retrieve unique rows from a table. By unique I mean, one row per instance. For example,
id language
1 java
2 python
if I have two rows and when I call a SELECT query with limit one, For first instance I should get id 1 and for second instance id 2 should be returned. So far I have tried using JPA Lock #Lock(value = LockModeType.PESSIMISTIC_WRITE) This doesn't work. Each time I get the same row. I have also tried using JdbcTemplate with SELECT * FROM some_table LIMIT 1 FOR UPDATE SKIP LOCKED. This is also not working. My postgres version is 10.3 . Is there a way to achieve this.
Number of instances of my application might grow in the future. So I want to handle this as well.
Thanks in advance.
You want each instance to process a different partition of your table. In this case, I would recommend using a partitioned step.
For example, you can partition the table by even/odd IDs, and make each instance process a partition. This is IMO better than locking the table and using LIMIT 1 to force each instance read one row (This won't work as you mentioned and even if it works, it would be very poor in terms of performance).
You can find a sample job of how to partition a table here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/resources/jobs/partitionJdbcJob.xml along with the corresponding partitioner here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/java/org/springframework/batch/sample/common/ColumnRangePartitioner.java

Which will have more Performance in DB2

I need to insert a table from a master table having 2 billion records . Insert needs to satisfy some conditons and also in the some columns to be calculated and then it has to be inserted.
I am having 2 options but I dont know which to follow to improve performance.
1 option
Create a cursor by filtering from master table with the conditons. and get one by one record for caluclation and then last insertion to the child table
2 option
insert first using into conditon and then calculation using update statement.
Please Assist.
Having a cursor to get data, perform calculation, and then insert into the database will be time consuming. My guess is that since it involves data connections and I/O for each retrieval and insertion (for both the databases )
Databases are usually better with bulk operations, so it will definitely give you better performance if you use Option 2. Option 2 is better for troubleshooting also ( as the process is cleanly separated - step1: download, step2: calculate) than Option 1 where in case of an error in the middle of the process, you'll be forced to redo all the steps again.
Opening a cursor and inserting records one by one might have serious performance issues at the volumes on the order of a Billion . Especially if you have a weak network between your Database tier and App tier . The fastest way to do this could be to use Db2 export utility to download data , let the program manipulate the data from the file and later load the file back to the child table . Apart from the file based option you can also consider the following approaches
1) Write an SQL stored procedure (No need to ship the data out of the database to make changes )
2) If you using Java/JDBC use Batch Update feature to update multiple records at the same time
3) If you using a tool like Informatica, turn on the bulk load feature in informatica
Also see the IBM DW article on imporving insert performance . The article is a little bit older but concepts are still valid . http://www.ibm.com/developerworks/data/library/tips/dm-0403wilkins/