don't use index in pandas-profiling

don't use index in pandas-profiling - pandas-profiling

When running pandas-profiling on a dataframe I see it analyses the index as a variable. Note: My index is a unique key (named UUID)
Is there a way to exclude bringing in the index to report?
I understand I could remove it in pandas but in my head I would like to do
ProfileReport(df, use_index=False)

I agree that having an option to use_index=False in ProfileReport would be nice and clean, it apparently doesn't exist (yet).
So currently the only way I can find to exclude bringing the index into the report is by dropping it before profiling:
df.reset_index(drop=True, inplace=True)
This gets the job done.

Related

Postgresql best index for datetime ranges

I have a Postgre table “tasks” with the fields “start”:timestamptz, “finish”:timestamptz, “type”:int (and a lot of others). It contains about 200m records. Start, finish and type fields have a separate b-tree indexes.
I’d like to build a report “Tasks for a period” and need to get all tasks which lay (fully or partially) inside the reporting period. Report could be built for all task types or for the specific one.
So I wrote the SQL:
SELECT * FROM tasks
WHERE start<={report_to}
AND finish>={report_from}
AND ({report_tasktype} IS NULL OR type={report_tasktype})
and it runs for ages even on short reporting periods.
Please advice if there a way to improve performance by altering the query or by creating new indexes on the table? For some reasons I can’t change the structure of the “tasks” table

You would want a GiST index on the range. Since you already have it stored as two end points rather than as a range, you could do a functional index to convert them on the fly.
ON task USING GIST (tstzrange(start,finish))
And then compare the ranges for overlap with &&
It may also improve things to add "type" as a second column to the index, which would require the btree_gist extension.

Sphinx / Manticore - base one plain index off another?

I have a plain text index that sucks data from MySQL and inserts it into Manticore in a format I need (e.g. converting datetime strings to timestamp, CONCATing some fields etc.
I then want to create a second plain text index based off this data to group it further. This will save me having to either re-run the normalisation that's done to the first index on INSERT or make it easier for me to query in the future.
For example, my first index is a list of all phone calls that have been made / received (telephone number, duration, agent). The second index should group by Year-Month-Date in such a way that I can see how many calls each agent made on that day. This means I end up with idx_phone_calls and idx_phone_calls_by_date.
Currently, I generate the first index from MySQL, then get Manticore to query itself (by setting the MySQL host to localhost. It works, but it feels as though I should be able to query Manticore directly from within the index. However, I'm struggling to find if that's possible.
Is there a better way to do it?

Well Sphinx/Manticore, has its own GROUP BY function. So maybe can just run the final query against the original index anyway, avoid the need for the second index.
Sphinx's Aggregation (in some way) is more powerful than MySQL, and can do some 'super aggregation' functions (like with WITHIN GROUP ORDER BY)
But otherwise there is no direct way to create an off another (eg there is no CREATE TABLE idx_phone_calls_by_date SELECT ... FROM idx_phone_calls ... )
Your 'solution' of directing indexer to query the data from searchd is good. In general this should be pretty efficent, particully on localhost, there is little overhead. Maintains the logical seperation of searchd being for queries, indexer being for well building indexes.

Postgres query shows under the 'Most time consuming section' of heroku

This below query I am using for a search:
I guess this current time is not bad but still I am searching for some kind of more optimizations. Also I saw in the analyze report this nested loop and nested loop joins shows the red. It would be great if I get an idea to reduce that. I was thinking to add index for search key. It would be great if I can get more suggestions to improve this. Here I have added the explain analyze result with 3 times execution, which ran in production

You could try to add ingredients.name or ingredients.code to an existing index or to create a new index so that more rows are filtered during ingredients index scan.
You should also try to avoid to use fonction on column name such as LOWER(ingredients.name) to make sure the right index is used.

Postgres - gin index doesn't work

it seems that my server won't use gin index.
I've created a new database with one table.
I've inserted one row as example.
I've loaded trigram extension and created gin index using trigrams
But when I check if the index works right I can see it doesn't
Any ideas?
SQL: http://pastebin.com/1yDQQA1Z
P.S. A day ago I've followed a tutorial about trigrams. Basically it was the same like my example above. The table had 2 columns, numeric(5, 0) and character varying (the one with gin trgm index). Query was with like operator using "%" and index was working (I could see Bitmap using in query explain), so I know, my server can use index (and its properly installed).
Thanks in advance.

Don't test on one row, it is meaningless.
Here's an excerpt of the documentation explaining why, in Examining Index Usage:
Use real data for experimentation. Using test data for setting up
indexes will tell you what indexes you need for the test data, but
that is all.
It is especially fatal to use very small test data sets. While
selecting 1000 out of 100000 rows could be a candidate for an index,
selecting 1 out of 100 rows will hardly be, because the 100 rows
probably fit within a single disk page, and there is no plan that can
beat sequentially fetching 1 disk page.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!

You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.

You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).

When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance

Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.