I want to use my column data in other columns for arithmetic operations. I am fetching data using join. Is there a way to access this data by index. However, table is same. I am using postgresql. Hopes for suggestion
Related
I've inherited the task of retrieving data from a Postgres table.
The table has ~1m rows, and there are about 145k rows that I wish to retrieve. These 145k rows have a common string in one of their columns batch_name that I can use to search for them.
The table has two columns payload & result that are of type JSON. The result column contains the data that I wish to retrieve.
When I make even the simplest queries to the table:
SELECT * FROM table_name WHERE batch_name = 'an_id' limit 10
The request takes ~7-10 seconds to return data.
This is despite the fact that the batch_name column has an index on it and it's of type varchar(255)
Whilst investigating this, I've discovered that the JSON objects in the result column and payload column can be absolutely gigantic objects. When prettified, they are sometimes ~27k lines long.
These gigantic JSON objects seem to be the root cause of the problem.
My questions are:
What can I do to improve the efficiency of this query? Or is the ultimate solution here to just modify the table such that we are no longer storing gigantic JSON objects?
Given that I don't need to actually query fields in these JSON objects (but I DO need to retrieve them), would simply storing them as strings improve efficiency?
Why is storing large JSON objects SO inefficient?
Thanks in advance for any help, it's much appreciated.
I need to write CSV data into MongoDB (version 4.4.4). Right now I'm using MongoEngine as a data layer for my application.
Each CSV has at least 4 million records and 8 columns.
What is the fastest way to bulk insert (if the data doesn't exist yet) or update (if the data is already in the collection)?
Right now I'm doing the following:
for inf in list:
daily_report = DailyReport.objects.get(company_code=code, date=date)
if daily_report is not None:
inf.id = daily_report.id
inf.save()
The list is a list of DailyReports built from the CSV data.
The _id is auto-generated. However, for business purposes the primary keys are the variables company_code (StringField) and date (DateTimeField).
The DailyReport class has an unique compound index made of the following fields: company_code and date.
The previous code traverse through the list and for each DailyReport it looks for an existing DailyReport in the database with the same company_code and date. If so, the id from the DailyReport in the database is assigned to the DailyReport built from the CSV data. With the id assigned to the object, the object is saved using the Object.save() method from MongoEngine.
The insert operation is being done at each object and is incredibly slow.
Any ideas on how to make this process it faster?
I am building a pipeline that connects to an on-prem Oracle DB using the Database Plugin, queries two tables (table_a, table_b), and joins those tables using Joiner Plugin, before uploading to a BigQuery table.
The problem I have now is that the Foreign Keys to join table_a and table_b have different data types when I use Get Schema in the Database Plugin. In Joiner, I am joining the tables on table_a.customer_id = table_b.customer_id.
The dtype of table_a.customer_id is LONG but table_b.customer_id is DOUBLE. In the source Oracle DB, both columns are actually integers. For some reason, though, using Get Schema thinks they are LONG and DOUBLE.
I am obviously getting an error in Joiner trying to join on a foreign keys with different data types.
Is there a way to cast/convert the columns from the tables to match so that I can use Joiner?
I've seen some examples using Wrangler Transform to parse dates, but I don't see anything to convert to any other data types. I couldn't find any directive examples either: https://github.com/data-integrations/wrangler.
You can transform your data before joining them by using any of the transform plugins that Cloud Data Fusion offers. As #muscat mentioned, you can use Wrangler transform and utilize the Set type directives, or you can use the Projection transform and configure the Convert field.
I am trying to set an index to a tstzrange[] column in PostgreSQL 10. I created the column via the pgAdmin 4 GUI, set its name and data type as tstzrange[] and set it as not null, nothing more.
I then did a CREATE EXTENSION btree_gist; for the database and it worked.
Then I saw in the documentation that I should index the range and I do:
CREATE INDEX era_ac_range_idx ON era_ac USING GIST (era_ac_range);
...but then I get:
ERROR: data type tstzrange[] has no default operator class for
access method "gist"
which, frankly, I don't know what it actually means, or how to solve it. What should I do ?
PS, that column is currently empty, has no data yet.
Ps2, This table describes chronological eras, there is an id, the era name (eg the sixties) and the timezone range (eg 1960-1969).
A date is inserted by the user and I want to check in which era it belongs.
Well, you have an array of timestamp-ranges as a single column. You can index an array with a GIN index and a range with (iirc) GIN or GiST. However, I'm not sure how an index on a column that is both would operate. I guess you could model it as an N-dimensional r-tree or some such.
I'm assuming you want to check for overlapping ranges.Could you normalise the data and have a linked table with one range in each row?
I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100.
I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
The table needs to store the productId and corresponding Timeseries data.
Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.
I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.
Other options
hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.