I use an h2 database file in my Java app. The database is relatively small, 5M, on a SSD. The biggest table is about 25000 rows (25 columns). A simple select query on this table takes around 1.3 seconds. It seems very slow. The table has a primary key on ID. Here is the test code:
long tic = System.nanoTime();
final String sqlCmd = "select * from Transactions order by ID";
//final String sqlCmd = "select ID from TRANSACTIONS order by ID";
try (Statement statement = mConnection.createStatement();
ResultSet resultSet = statement.executeQuery(sqlCmd)) {
} catch (SQLException e) {
mLogger.error("SQLException " + e.getSQLState(), e);
}
long toc = System.nanoTime();
System.err.println("transactionTimingTest: " + (toc-tic)/1e6);
Repeating the same select query runs reasonably faster, about 0.2 seconds. Is there any way to improve the timing of the first run of the select query on the table?
Related
I would expect to always receive a resultset with one row on a SELECT COUNT, but results.next() always returns false. This is on HSQLDB 2.5.1.
The code below prints:
number of columns: 1. First column C1 with type INTEGER
No COUNT results
statement = connection.createStatement();
// check if table empty
statement.executeQuery("SELECT COUNT(*) FROM mytable");
ResultSet results = statement.getResultSet();
System.out.println("number of columns: " + results.getMetaData().getColumnCount() + ". First column " +results.getMetaData().getColumnName(1) + " with type " +results.getMetaData().getColumnTypeName(1) );
int numberOfRows = 0;
boolean hasResults = results.next();
if (hasResults){
numberOfRows = results.getInt(1);
System.out.println("Table size " + numberOfRows );
}else{
System.out.println("No COUNT results" );
}
statement.close();
Executing the same SQL statement in my SQL console works fine:
C1
104
Other JDBC actions on this database work fine as well.
Is there something obvious I'm missing?
The getResultSet method is applicable to execute, but not executeQuery which returns a ResultSet. That is the one you need to refer to, at the moment you are losing it as you are not assigning it to anything.
See https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#executeQuery(java.lang.String) and https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#getResultSet()
ResultSet results = statement.executeQuery("SELECT COUNT(*) FROM mytable");
Using Yii2 with TimescaleDB (PostgreSQL Extension for time series data) gets very slow on large tables (> 1 billion rows).
I think that comes from the function getTotalCount()in yii\data\BaseDataProvider which gets called to provide pagination on every page.
We know from TimescaleDB that row counting on large tables gets very slow. If we replace the function getTotalCount() in the model by a "coorrect" solution as
static public function getTotalCount()
{
$connection = Yii::$app->getDb();
$command = $connection->createCommand("
SELECT h.schema_name,
h.table_name,
h.id AS table_id,
h.associated_table_prefix,
row_estimate.row_estimate
FROM _timescaledb_catalog.hypertable h
CROSS JOIN LATERAL ( SELECT sum(cl.reltuples) AS row_estimate
FROM _timescaledb_catalog.chunk c
JOIN pg_class cl ON cl.relname = c.table_name
WHERE c.hypertable_id = h.id and h.table_name = 'value'
GROUP BY h.schema_name, h.table_name) row_estimate
ORDER BY \"schema_name\", \"table_name\";
");
$result = $command->queryAll();
return floatval($result[0]["row_estimate"]);
}
But do we really have to perform this query for pagination?
This seems quite complex. How to optimize this? Maybe just by returning a very large but fixed number since nobody needs pagination over a billion pages?
Example:
const LARGE_NUM = 1000;
static public function getTotalCount()
{
if (Yii::$app->session->get("mytable-rows")>MyClass::LARGE_NUM) {
return MyClass::LARGE_NUM;
}
$totalRows = parent::getTotalCount();
Yii::$app->session->set("mytable-rows",$totalRows);
return $totalRows;
}
Is this a valid and fast solution? Are the dangerous side effects?
We are benchmarking some queries to see if they will still work reliably for "a lot of" data. (1 million isn't that much to be honest, but Postgres already fails here, so it evidently is.)
Our Java code to call this queries looks something like that:
#PersistenceContext
private EntityManager em;
#Resource
private UserTransaction utx;
for (int i = 0; i < 20; i++) {
this.utx.begin();
for (int inserts = 0; inserts < 50_000; inserts ++) {
em.createNativeQuery(SQL_INSERT).executeUpdate();
}
this.utx.commit();
for (int parameter = 0; parameter < 25; parameter ++)
long time = System.currentTimeMillis();
Assert.assertNotNull(this.em.createNativeQuery(SQL_SELECT).getResultList());
System.out.println(i + " iterations \t" + parameter + "\t" + (System.currentTimeMillis() - time) + "ms");
}
}
Or with plain JDBC:
Connection connection = //...
for (int i = 0; i < 20; i++) {
for (int inserts = 0; inserts < 50_000; inserts ++) {
try (Statement statement = connection.createStatement();) {
statement.execute(SQL_INSERT);
}
}
for (int parameter = 0; parameter < 25; parameter ++)
long time = System.currentTimeMillis();
try (Statement statement = connection.createStatement();) {
statement.execute(SQL_SELECT);
}
System.out.println(i + " iterations \t" + parameter + "\t" + (System.currentTimeMillis() - time) + "ms");
}
}
The queries we tried were a simple INSERT into a table with JSON and a INSERT over two tables with about 25 lines. The SELECT has one or two JOINs and is pretty easy. One set of queries is (I had to anonymize the SQL else I wouldn't have been allowed to post it):
CREATE TABLE ts1.p (
id integer NOT NULL,
CONSTRAINT p_pkey PRIMARY KEY ("id")
);
CREATE TABLE ts1.m(
pId integer NOT NULL,
mId character varying(100) NOT NULL,
a1 character varying(50),
a2 character varying(50),
CONSTRAINT m_pkey PRIMARY KEY (pI, mId)
);
CREATE SEQUENCE ts1.seq_p;
/*
* SQL_INSERT
*/
WITH p AS (
INSERT INTO ts1.p (id)
VALUES (nextval('ts1.seq_p'))
RETURNING id AS pId
)
INSERT INTO ts1.m(pId, mId, a1, a2)
VALUES ((SELECT pId from p), 'M1', '11', '12'),
((SELECT pId from p), 'M2', '13', '14'),
/* ... about 20 to 25 rows of values */
/*
* SQL_SELECT
*/
WITH userInput (mId, a1, a2) AS (
VALUES
('M1', '11', '11'),
('M2', '12', '15'),
/* ... about "parameter" rows of values */
)
SELECT m.pId, COUNT(m.a1) AS matches
FROM userInput u
LEFT JOIN ts1.m m ON (m.mId) = (u.mId)
WHERE (m.a1 IS NOT DISTINCT FROM u.a1) AND
(m.a2 IS NOT DISTINCT FROM u.a2) OR
(m.a1 IS NULL AND m.a2 IS NULL)
GROUP BY m.pId
/* plus HAVING, additional WHERE clauses etc. according to the use case, but that just speeds up the query */
When executing, we get the following output (the values are supposed to rise steadly and linearly):
271ms
414ms
602ms
820ms
995ms
1192ms
1396ms
1594ms
1808ms
1959ms
110ms
33ms
14ms
10ms
11ms
10ms
21ms
8ms
13ms
10ms
As you can see, after some value (usually at around 300,000 to 500,000 inserts) the time needed for the query drops significantly. Sadly we can't really debug what the result is at that point (other than that it's not null), but we assume it's an empty list, because the database tables are empty.
Let me repeat that: After half a million INSERTS, Postgres clears tables.
Of course that's not acceptable at all.
We tried different queries, all of easy to medium difficulty, and all produced this behavior, so we assume it's not the queries.
We thought that maybe the sequence returned a value too high for a column integer, so we droped and recreated the sequence.
Once there was this exception:
org.postgresql.util.PSQLException : FEHLER: Verklemmung (Deadlock) entdeckt
Detail: Prozess 1620 wartet auf AccessExclusiveLock-Sperre auf Relation 2001098 der Datenbank 1937678; blockiert von Prozess 2480.
Which I'm entirely unable to translate. I guess it's something like:
org.postgresql.util.PSQLException : ERROR: Jamming? Clamping? Constipation? (Deadlock) found
But I don't think this error has anything to do with the clearing of the table. We just tested against the wrong database, so multiple queries were run on the same table. Normally we have one database per benchmark test.
Of course it's important that we find out what the error is, so that we can decide if there is any risk to our customers losing their data (because again, on error the database empties some table of its choice).
Postgres version: PostgreSQL 10.6, compiled by Visual C++ build 1800, 64-bit
We tried PostgreSQL 9.6.11, compiled by Visual C++ build 1800, 64-bit, too. And we never had the same problem there (even though that could just be luck, since it's not 100% reproducible).
Do you have any idea what the error is? Or how we could debug it? The entire benchmark test runs for an hour, so there is no immediate feedback.
I am currently working on weather monitoring.
For example a record of temperature has a date and a location (coordinates).
All of the coordinates are already in the database, what I need to add is time and the value of the temperature. Values and metadata are in a CSV file.
Basically what I'm doing is:
Get time through the file's name
Insert time into DB, and keep the primary key
Reading file, get the value and coordinates
Select query to get the id of the coordinates
Insert weather value with foreign keys (time and coordinates)
The issue is that the
"SELECT id FROM location WHERE latitude = ... AND longitude = ..."
is too slow. I have got 230k files and currently one file takes more than 2 minutes to be processed... Edit: by changing the index, it now takes 25 seconds and is still too slow. Moreover, the PreparedStatement is also still slower and I cannot figure out why.
private static void putFileIntoDB(String variableName, ArrayList<String[]> matrix, File file, PreparedStatement prepWeather, PreparedStatement prepLoc, PreparedStatement prepTime, Connection conn){
try {
int col = matrix.size();
int row = matrix.get(0).length;
String ts = getTimestamp(file);
Time time = getTime(ts);
// INSERT INTO takes 14ms
prepTime.setInt(1, time.year);
prepTime.setInt(2, time.month);
prepTime.setInt(3, time.day);
prepTime.setInt(4, time.hour);
ResultSet rs = prepTime.executeQuery();
rs.next();
int id_time = rs.getInt(1);
//for each column (longitude)
for(int i = 1 ; i < col ; ++i){
// for each row (latitude)
for(int j = 1 ; j < row ; ++j){
try {
String lon = matrix.get(i)[0];
String lat = matrix.get(0)[j];
String var = matrix.get(i)[j];
lat = lat.substring(1, lat.length()-1);
lon = lon.substring(1, lon.length()-1);
double latitude = Double.parseDouble(lat);
double longitude = Double.parseDouble(lon);
double value = Double.parseDouble(var);
// With this prepared statement, instruction needs 16ms to be executed
prepLoc.setDouble(1, latitude);
prepLoc.setDouble(2, longitude);
ResultSet rsLoc = prepLoc.executeQuery();
rsLoc.next();
int id_loc = rsLoc.getInt(1);
// Whereas this block takes 1ms
Statement stm = conn.createStatement();
ResultSet rsLoc = stm.executeQuery("SELECT id from location WHERE latitude = " + latitude + " AND longitude =" + longitude + ";" );
rsLoc.next();
int id_loc = rsLoc.getInt(1);
// INSERT INTO takes 1ms
prepWeather.setObject(1, id_time);
prepWeather.setObject(2, id_loc);
prepWeather.setObject(3, value);
prepWeather.execute();
} catch (SQLException ex) {
Logger.getLogger(ECMWFHelper.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
} catch (SQLException ex) {
Logger.getLogger(ECMWFHelper.class.getName()).log(Level.SEVERE, null, ex);
}
}
What I already did:
Set two B-tree index on table location on columns latitude and longitude
Drop foreign keys constraints
PreparedStatements in parameters are :
// Prepare selection for weather_radar foreign key
PreparedStatement prepLoc = conn.prepareStatement("SELECT id from location WHERE latitude = ? AND longitude = ?;");
PreparedStatement prepTime = conn.prepareStatement("INSERT INTO time(dataSetID, year, month, day, hour) " +
"VALUES(" + dataSetID +", ?, ? , ?, ?)" +
" RETURNING id;");
// PrepareStatement for weather_radar table
PreparedStatement prepWeather = conn.prepareStatement("INSERT INTO weather_radar(dataSetID, id_1, id_2, " + variableName + ")"
+ "VALUES(" + dataSetID + ", ?, ?, ?)");
Any idea to get things go quicker?
Ubuntu 16.04 LTS 64-bits
15.5 Gio
Intel® Core™ i7-6500U CPU # 2.50GHz × 4
PostgreSQL 9.5.11 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit
Netbeans IDE 8.2
JDK 1.8
postgresql-42.2.0.jar
The key issue you have here is you miss ResultSet.close() and Statement.close() kind of calls.
As you resolve that (add relevant close calls) you might find that having SINGLE con.prepareStatement call (before both for loops) would improve the performance even further (of course, you will not need to close the statement in a loop, however you still would need to close resultsets in a loop).
Then you might apply batch SQL
Using EXPLAIN, the point where query becomes latent could be figured out.
One of the situation where I have encountered case alike being:
Compound queries e.g. parameterized similar date ranges, from different tables and then joining them on some indexed value. Even if the date in the above serve as index still the query produced in preparedStatement, could not hit the indexes and ended up doing a scan over the joining data.
I am pretty new to Java JDBC. I am trying to create a JDBC preparedstatement to do a SELECT between two Oracle DATE values.
I know that data exists between these two times, as I can do the query directly.
When I execute the prepared statement from within my JDBC code, however, it returns 0 rows.
My input start and times are Long Unix time values in milliseconds.
I have tried to pare down the code to the bare minimum:
public static List<Oam1731Sam5Data> getData(Long startTime, Long endTime) {
try {
String query = "SELECT timecaptured from oam1731data5 " +
"WHERE timecaptured between ? and ?";
pstmt = conn.prepareStatement(query); // create a statement
Date javaStartDate = new Date(startTime);
Date javaEndDate = new Date(endTime);
pstmt.setDate(1, javaStartDate);
pstmt.setDate(2, javaEndDate);
ResultSet rs = pstmt.executeQuery();
while (rs.next()) {
String serviceId = rs.getString("SERVICEID");
String recordSource = rs.getString("RECORDSOURCE");
Date timeCaptured = rs.getDate("TIMECAPTURED");
Long oneWayMaxDelay = rs.getLong("ONEWAYMAXDELAY");
Long twoWayMaxDelay = rs.getLong("TWOWAYMAXDELAY");
Long twoWayMaxDelayVar = rs.getLong("TWOWAYMAXDELAYVAR");
Long packetLoss = rs.getLong("PACKETLOSS");
}
} catch (SQLException se) {
System.err.println("ERROR: caught SQL exception, code: " + se.getErrorCode() +
", message: " + se.getMessage());
}
The issue is that this returns 0 rows, where the same query returns data:
select insert_date, recordsource, serviceid, timecaptured, onewaymaxdelay, twowaymaxdelay, twowaymaxdelayvar, packetloss from oam1731data5
where timecaptured between to_date('2012-01-18 07:00:00','YYYY-MM-DD HH24:MI:SS')
and to_date('2012-01-18 08:00:00','YYYY-MM-DD HH24:MI:SS')
order by insert_date
DBMS_OUTPUT:
INSERT_DATE RECORDSOURCE SERVICEID TIMECAPTURED ONEWAYMAXDELAY TWOWAYMAXDELAY TWOWAYMAXDELAYVAR PACKETLOSS
1/18/2012 10:43:36 AM EV TAMP20-MTRO-NID1SYRC01-MTRO-NID1 1/18/2012 7:25:24 AM 40822 79693 343 0
1/18/2012 10:43:36 AM EV SYRC01-MTRO-NID1TAMP20-MTRO-NID1 1/18/2012 7:25:13 AM 39642 79720 334 0
1/18/2012 10:43:36 AM EV TAMP20-MTRO-NID1SYRC01-MTRO-NID1 1/18/2012
I have seen and ready many posts about problems somewhat like this, but have not been
able to find the key yet!
I thought about trying to make my query use a string and simply convert my dates to strings to be able to insert them for the Oracle TO_DATE function, but it seems like I should not have to do this.
And here is the output from my println statements. Is it an issue that the dates that print do NOT show the time portion?
SQL query: SELECT timecaptured from oam1731data5 WHERE timecaptured between ? and ?
Java Oracle date: 2012-01-18 end date 2012-01-18
Thanks in advance for any help.
Mitch
A java.sql.Date represents a date without time. You must use a Timestamp if you care about the time part.
The javadoc of java.sql.Date says:
To conform with the definition of SQL DATE, the millisecond values
wrapped by a java.sql.Date instance must be 'normalized' by setting
the hours, minutes, seconds, and milliseconds to zero in the
particular time zone with which the instance is associated.
Using Date for your start and end dates shouldn't cause any problem. The data in the specified interval should be retrieved in that time interval (Those data that you printed which were inserted on 1/18/2012 are supposed to be retrieved in this case.) Therefore I don't think that omitting time portion is your problem.