Copy file from CSV into Postgresql table - timestamp problem - postgresql

I have data in csv format with rows of daily company stock quotes which look like this:
INTSW2027243,20200319,7.7700,7.7800,7.3600,7.3600,2442
INTSW2027391,20200319,7.4200,7.6000,6.8300,6.8900,15262
INTSW2027409,20200319,7.4800,7.5600,7.4200,7.5600,743
INTSW2028365,20200319,0.7100,0.7200,0.5400,0.5500,47495
Atari,20200319,351.0000,365.5000,350.0000,357.0000,9040
The second column of the file is the date: 2020-03-19 in this case.
I use the COPY FROM command to update the postgres companies table.
COPY companies (ticker, date, open, high, low, close, vol) FROM '/home/user/Downloads/company.csv' using delimiters ',' with null as '\null';
Whenever I use the COPY FROM command to copy the file into postgresql table, my date '20200319' changes to 1970-08-22 20:11:59, and I end up with my last record which looks something like that:
id | ticker | date | open | high | low | close | vol
---------+--------+---------------------+------+-------+-----+-------+------
2248402 | Atari | 1970-08-22 20:11:59 | 351 | 365.5 | 350 | 357 | 9040
If I manually update the companies table with the following command, I get proper results:
INSERT INTO companies (ticker, date, open, high, low, close, vol) VALUES ('Atari', to_timestamp('20200319', 'YYYYMMDD')::timestamp without time zone ,351.0000,365.5000,350.0000,357.0000,9040);
However the above solution doesn't work if the data is stored in a csv file.
Proper result:
id | ticker | date | open | high | low | close | vol
---------+--------+---------------------+------+-------+-----+-------+------
2250513 | Atari | 2020-03-19 00:00:00 | 351 | 365.5 | 350 | 357 | 9040
My Questions:
Is there a way to change the output date format in COPY FROM command?
What is the proper way to update large postgres tables with daily quotes from csv files in bulk by means of sql commands?
My postgres version:
psql (PostgreSQL) 11.7
Edit:
This is Not psql copy question.

Ok, I took advice from madflow's comments and partially from Abelisto, and adjusted my SET datestyle.
Initially I tied:
SET datestyle = 'YYYYMMDD'; (and many more combinations of it)
But was getting the following error:
invalid value for parameter "DateStyle": "YYYYMMDD"
I then moved on to trying: set datestyle to "YMD";
And got: SET
Now when I try: show datestyle;
I get:
DateStyle
-----------
ISO, YMD
(1 row)
And, when I try the following command:
COPY companies (ticker, date, open, high, low, close, vol) FROM '/home/user/Downloads/company.csv' using delimiters ',' with null as '\null';
It looks like I'm finally getting the right date format, so no need to adjust the COPY FROM command:
id | ticker | date | open | high | low | close | vol
---------+--------+---------------------+------+-------+-----+-------+-------
1379256 | Atari | 2020-03-16 00:00:00 | 294 | 337.5 | 256 | 337 | 48690
1379257 | Atari | 2020-03-17 00:00:00 | 347 | 381 | 338 | 357 | 36945
1379258 | Atari | 2020-03-18 00:00:00 | 364 | 380 | 350 | 357 | 19650
2251920 | Atari | 2020-03-19 00:00:00 | 351 | 365.5 | 350 | 357 | 9040
So, thanks guys for suggestions!

Related

Postgresql: Help calculating delta value in Postgres while using group by function

I am building a stockmarket database. I have one table with timestamp, symbol, price and volume. The volume is cumulative volume traded per day. for e.g.
| timestamp | symbol | price | volume |
|----------------------------|--------|----------|--------|
| 2022-06-11 12:42:04.912+00 | SBIN | 120.0000 | 5 |
| 2022-06-11 12:42:25.806+00 | SBIN | 123.0000 | 6 |
| 2022-06-11 12:42:38.993+00 | SBIN | 123.4500 | 8 |
| 2022-06-11 12:42:42.735+00 | SBIN | 108.0000 | 12 |
| 2022-06-11 12:42:45.801+00 | SBIN | 121.0000 | 14 |
| 2022-06-11 12:43:43.186+00 | SBIN | 122.0000 | 16 |
| 2022-06-11 12:43:45.599+00 | SBIN | 125.0000 | 17 |
| 2022-06-11 12:43:51.655+00 | SBIN | 141.0000 | 20 |
| 2022-06-11 12:43:54.151+00 | SBIN | 111.0000 | 24 |
| 2022-06-11 12:44:01.908+00 | SBIN | 123.0000 | 27 |
I want to query to get OHLCV (open high low close and volume) data. I am using the following to get OHLC data but not volume and i am getting proper OHLC. Note that i am using timescale db timebucket function similar to date_trunc
SELECT
time_bucket('1 minute', "timestamp") AS time,
symbol,
max(price) AS high,
first(price, timestamp) AS open,
last(price, timestamp) AS close,
min(price) AS low,
FROM candle_ticks
GROUP BY time, symbol
ORDER BY time DESC, symbol;
So for volume, I need to calculate the difference of max / last volume in the same time and max/last volume in the previous time frame. to get the following data
| time | symbol | high | open | close | low | volume |
|------------------------|--------|----------|----------|----------|----------|--------|
| 2022-06-11 12:44:00+00 | SBIN | 123.0000 | 123.0000 | 123.0000 | 123.0000 | 14 |
| 2022-06-11 12:43:00+00 | SBIN | 141.0000 | 122.0000 | 111.0000 | 111.0000 | 10 |
| 2022-06-11 12:42:00+00 | SBIN | 123.4500 | 120.0000 | 121.0000 | 108.0000 | 3 |
What should be sql be like? I tried to use lag, but lag and group buy together is not playing well..
Would it work if you put your query in a CTE?
with ivals as (
SELECT time_bucket('1 minute', "timestamp") AS time,
symbol,
max(price) AS high,
first(price, timestamp) AS open,
last(price, timestamp) AS close,
min(price) AS low,
max(volume) AS close_volume
FROM candle_ticks
GROUP BY time, symbol
)
select i.*,
close_volume - coalesce(
lag(close_volume)
over (partition by symbol, time::date
order by time),
0
) as time_volume
from ivals i
;
Similar to Mike Organek's answer, you can collect the data into buckets via CTE and then in your main query, subtract a minute from the time column to get the time value for the previous bucket. You can use that value to
LEFT JOIN the row for the previous time bucket within the same day:
WITH buckets as (
SELECT
time_bucket('1 minute', "timestamp") AS time,
symbol,
max(price) AS high,
first(price, timestamp) AS open,
last(price, timestamp) AS close,
min(price) AS low,
max(volume) AS close_volume
FROM candle_ticks
GROUP BY time, symbol
ORDER BY time DESC, symbol
)
SELECT
b.*,
coalesce(b.close_volume - b2.close_volume,0) time_volume
FROM
buckets b
LEFT JOIN buckets b2 ON (b.time::date = b2.time::date and b.time - interval '1 minute' = b2.time)
This method will avoid the restrictions that TimescaleDB places on window functions.

Increment date within while loop using postgresql on Redshift table

MY SITUATION:
I have written a piece of code that returns a dataset containing a web user's aggregated activity for the previous 90 days and returns a score, subsequent to some calculation. Essentially, like RFV.
A (VERY) simplified version of the code can be seen below:
WITH start_data AS (
SELECT user_id
,COUNT(web_visits) AS count_web_visits
,COUNT(button_clicks) AS count_button_clicks
,COUNT(login) AS count_log_in
,SUM(time_on_site) AS total_time_on_site
,CURRENT_DATE AS run_date
FROM web.table
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, CURRENT_DATE) AND CURRENT_DATE
AND some_flag = 1
AND some_other_flag = 2
GROUP BY user_id
ORDER BY user_id DESC
)
The output might look something like the below:
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
This output is then stored in it's own AWS/Redhsift table and will form base table for the task.
SELECT *
into myschema.base_table
FROM start_data
DESIRED OUTPUT:
What I need to be able to do, is iteratively run this code such that I append new data to myschema.base_table, every day, for the previous 90's day aggregation.
The way I see it, I can either go forwards or backwards, it doesn't matter.
That is to say, I can either:
Starting from today, run the code, everyday, for the preceding 90 days, going BACK to the (first date in the table + 90 days)
OR
Starting from the (first date in the table + 90 days), run the code for the preceding 90 days, everyday, going FORWARD to today.
Option 2 seems the best option to me and the desired output looks like this (PARTITION FOR ILLUSTRATION ONLY):
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 412 | 339 |180 | 3600 | 20-01-20 |
| 2391823 | 417 | 6253 |863 | 2400 | 20-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 20-01-20 |
| 5561296 | 281 | 679 |262 | 4200 | 20-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 331 | 204 |83 | 3200 | 21-01-20 |
| 2391823 | 652 | 1222 |409 | 7200 | 21-01-20 |
| 3729128 | 71 | 248 |71 | 720 | 21-01-20 |
| 5561296 | 366 | 722 |519 | 3600 | 21-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 213 | 808 |57 | 3600 | 22-01-20 |
| 2391823 | 817 | 4265 |476 | 1200 | 22-01-20 |
| 3729128 | 33 | 128 |62 | 120 | 22-01-20 |
| 5561296 | 623 | 411 |283 | 2400 | 22-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
WHAT I HAVE TRIED:
I have successfully created a WHILE loop to sequentially increment the date as follows:
CREATE OR REPLACE PROCEDURE retrospective_data()
LANGUAGE plpgsql
AS $$
DECLARE
start_date DATE := '2020-11-20' ;
BEGIN
WHILE CURRENT_DATE > start_date
LOOP
RAISE INFO 'Date: %', start_date;
start_date = start_date + 1;
END LOOP;
RAISE INFO 'Loop Statment Executed Successfully';
END;
$$;
CALL retrospective_data();
Thus producing the dates as follows:
INFO: Date: 2020-11-20
INFO: Date: 2020-11-21
INFO: Date: 2020-11-22
INFO: Date: 2020-11-23
INFO: Date: 2020-11-24
INFO: Date: 2020-11-25
INFO: Date: 2020-11-26
INFO: Loop Statment Executed Successfully
Query 1 OK: CALL
WHAT I NEED HELP WITH:
I need to be able to apply the WHILE loop to the initial code such that the WHERE clause becomes:
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, start_date) AND start_date
But where start_date is the result of each incremental loop. Additionally, the result of each execution needs to be appended to the previous.
Any help appreciated.
It is fairly clear that you come from a procedural programming background and this first recommendation is to stop thinking in terms of loops. Databases are giant and powerful data filtering machines and thinking in terms of 'do step 1, then step 2' often leads to missing out on all this power.
You want to look into window functions which allow you to look over ranges of other rows for each row you are evaluating. This is exactly what you are trying to do.
Also you shouldn't cast a date to a string just to compare it to other dates (WHERE clause). This is just extra casting and defeats Redshift's table scan optimizations. Redshift uses block metadata that optimizes what data is needed to be read from disk but this cannot work if the column is being cast to another data type.
Now to your code (off the cuff rewrite and for just the first column). Be aware that group by clauses run BEFORE window functions and that I'm assuming that not all users have a visit every day. And since Redshift doesn't support RANGE in window functions will need to make sure all dates are represented for all user-ids. This is done by UNIONing with a sufficient number of rows that covers the date range. You may have a table like this or may want to create one but I'll just generate something on the fly to show the process (and this process makes the assumption that there are fewer dense dates than rows in the table - likely but not iron clad).
SELECT user_id
,COUNT(web_visits) AS count_web_visits_by_day,
,SUM(count_web_visits_by_day) OVER (partition by user_id order by visit_date rows between 90 preceding and current row)
...
,visit_date
FROM (
SELECT visit_date, user_id, web_visits, ...
FROM web.table
WHERE some_flag = 1 AND some_other_flag = 2
UNION ALL -- this is where I want to union with a full set of dates by user_id
( SELECT visit_date, user_id, NULL as web_visits, ...
FROM (
SELECT DISTINCT user_id FROM web.table
CROSS JOIN
SELECT CURRENT_DATE + 1 - row_number() over (order by visit_date) as visit_date
FROM web.table
)
)
)
GROUP BY visit_date, user_id
ORDER BY visit_date ASC, user_id DESC ;
The idea here is to set up your data to ensure that you have at least one row for each user_id for each date. Then the window functions can operate on the "grouped by date and user_id" information to sum and count over the past 90 row (which is the same as past 90 days). You now have all the information you want for all dates where each is looking back over 90 days. One query to give you all the information, no while loop, no stored procedures.
Untested but should give you the pattern. You may want to massage the output to give you the range you are looking for and clean up NULL result rows.

Output of Show Meta in SphinxQL

I am trying to check if my config has issues or I am not understanding Show Meta correctly;
If I make a regex in the config:
regexp_filter=NY=>New York
then if I do a SphinxQL search on 'NY'
Search Index where MATCH('NY')
and then Show Meta
it should show keyword1=New and keyword2=York not NY is that correct?
And if it does not then somehow my config is not working as intended?
it should show keyword1=New and keyword2=York not NY is that correct?
This is correct. When you do MATCH('NY') and have NY=>New York regexp conversion then Sphinx first converts NY into New York and only after that it starts searching, i.e. it forgets about NY completely. The same happens when indexing: it first prepares tokens, then indexes them forgetting about the original text.
To demonstrate (this is in Manticore (fork of Sphinx), but in terms of processing regexp_filter and how it affects searching works the same was as Sphinx):
mysql> create table t(f text) regexp_filter='NY=>New York';
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t values(0, 'I low New York');
Query OK, 1 row affected (0.01 sec)
mysql> select * from t where match('NY');
+---------------------+----------------+
| id | f |
+---------------------+----------------+
| 2810862456614682625 | I low New York |
+---------------------+----------------+
1 row in set (0.01 sec)
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | new |
| docs[0] | 1 |
| hits[0] | 1 |
| keyword[1] | york |
| docs[1] | 1 |
| hits[1] | 1 |
+---------------+-------+
9 rows in set (0.00 sec)

asof (aj) join strictly less than in KDB/Q

I have a quote table and trade table, and would like to list the quotes table and join in the trades table matching on timestamps strictly less than the timestamp of the trade.
For example:
q:([]time:10:00:00 10:01:00 10:01:00 10:01:02;sym:`ibm`ibm`ibm`ibm;qty:100 200 300 400)
t:([]time:10:01:00 10:01:00 10:01:02;sym:`ibm`ibm`ibm;px:10 20 25)
aj[`time;q;t]
returns
+------------+-----+-----+----+
| time | sym | qty | px |
+------------+-----+-----+----+
| 10:00:00 | ibm | 100 | |
| 10:01:00 | ibm | 200 | 20 |
| 10:01:00 | ibm | 300 | 20 |
| 10:01:02 | ibm | 400 | 25 |
+------------+-----+-----+----+
But I'm trying to get a result like:
+------------+-----+-----+----+
| time | sym | qty | px |
+------------+-----+-----+----+
| 10:00:00 | ibm | 100 | |
| 10:01:00 | ibm | 100 | 10 |
| 10:01:00 | ibm | 100 | 20 |
| 10:01:02 | ibm | 300 | 25 |
+------------+-----+-----+----+
Is there a join function that can match based on timestamps that are strictly less than time instead up-to and including?
I think if you do some variation of aj[`time;q;t] then you won't be able to modify the qty column as table t does not contain it. Instead you may need to use the more "traditional" aj[`time;t;q]:
q)#[;`time;+;00:00:01]aj[`time;#[t;`time;-;00:00:01];q]
time sym px qty
-------------------
10:01:00 ibm 10 100
10:01:00 ibm 20 100
10:01:02 ibm 25 300
This shifts the times to avoid matching where they are equal but does not contain a row for each quote you had in the beginning.
I think if you wish to join trades to quotes rather than quotes to trades as I have done you may need to think of some method of differentiating between 2 trades that occur at the same time as in your example. One method to do this may be to use the order they arrive, i.e. match first quote to first trade.
One “hacking” way I’m thinking is to just shift all trades by the minimum time unit do the aj and then shift back

How to Join data from a dataframe

I have one table with a lot of type of data, and some of the data has one information that is really important to analyse the rest of the data.
This is the table that I have
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| |
win | 1223 | 9 | 111| |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| |
progress | 1111 | 25 | | 225 |
This is the table that I want
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| 128 |
win | 1223 | 9 | 111| 129 |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| 225 |
progress | 1111 | 25 | | 225 |
I need to find the progress of the player, using the condition that, it has to be the first progress emitted after the data_ms (epoch unixtimstamp) of this event.
My table has 4 bilions lines of data, it's partitioned by data.
I tried to create a UDF function that should read the table filtering it, but it's not an option since you can't serialize spark to an UDF.
Any idea of how should I do this?
It seems like you want to fill gaps in column progress. I didn't really understand the condition but if it's based on data_ms then your hive query should look like this:
dataFrame.createOrReplaceTempView("your_table")
val progressDf = sparkSession.sql(
"""
SELECT name, player_id, data_ms, coins,
COALESCE(progress, LAST_VALUE(progress, TRUE) over (PARTITION BY player_id ORDER BY data_ms ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS progress
FROM your_table;
"""
)