Hive Double and Decimal value incorrect format - pyspark

I am migrating the sql statement greenplum to HiveSQL supported with pyspark scripts. we are encounter the issues if the double or decimal value, the value getting changed while doing the sum of the double or decimal value.
source data output as below like.
+------------------------+
| test_amount|
+------------------------+
| 6.987713783438851E8 |
| 1.42712398248124E10 |
| 6.808514167337805E8 |
| 1.5215888908771296E10 |
actual greenplum provide the output while doing the sum of the amount value as below like,
2022-01-16 2 14970011203.1514433
2022-01-17 2 15896793850.6107666
GP sum out put
+-------------+------+------------------------+
| _c0 | _c1 | _c2 |
+-------------+------+------------------------+
| 2022-01-16 | 2 | 1.4970011203156286E10 |
| 2022-01-17 | 2 | 1.5896740325505075E10 |
+-------------+------+------------------------+
please help us.

Related

How to get non-aggregated measures?

I calculate my metrics with SQL and publish the resulting table to Tableau Server. Afterward, use this data source to create charts and dashboards.
For one analysis, I already calculated the measures per day with SQL. When I use the resulting table in Tableau, it aggregates these measures to SUM by default. However, I don't want to have SUM or AVG of the average or SUM of the Percentiles.
What I want is the result when I don't select date dimension and not GROUP BY date in SQL as attached below.
Here is the query:
SELECT
-- date,
COUNT(DISTINCT id) AS count_of_id,
AVG(timediff_in_sec) AS avg_timediff,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_25,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_50
FROM
(
--subquery
) AS t1
-- GROUP BY date
Here are the first 10 rows of the resulting table:
+------------+--------------+-------------+---------------+---------------+
| date | avg_timediff | count_of_id | percentile_25 | percentile_50 |
+------------+--------------+-------------+---------------+---------------+
| 10/06/2020 | 61,65186364 | 22 | 8,5765 | 13,3015 |
| 11/06/2020 | 127,2913333 | 3 | 15,6045 | 17,494 |
| 12/06/2020 | 306,0348214 | 28 | 12,2565 | 17,629 |
| 13/06/2020 | 13,2664 | 5 | 11,944 | 13,862 |
| 14/06/2020 | 16,728 | 7 | 14,021 | 17,187 |
| 15/06/2020 | 398,6424595 | 37 | 11,893 | 19,271 |
| 16/06/2020 | 293,6925152 | 33 | 12,527 | 17,134 |
| 17/06/2020 | 155,6554286 | 21 | 13,452 | 16,715 |
| 18/06/2020 | 383,8101429 | 7 | 266,048 | 493,722 |
+------------+--------------+-------------+---------------+---------------+
How can I achieve the desired output above?
Drag them all into the dimensions list, then they will be static dimensions. For your use you could also just drag the Date field to Rows. Aggregating 1 value, which you have for each date, returns the same value whatever the aggregation type.

SQL Insert fails on i.name does not exist, when it seemingly does, during insert

I'm using Postgres SQL and pgAdmin. I'm attempting to copy data between a staging table, and a production table using INSERT INTO with a SELECT FROM statement with a to_char along the way. This may or may not be the wrong approach. The SELECT fails because apparently "column i.dates does not exist".
The question is: Why am I getting 'column i.dates does not exist'?
The schema for both tables is identical except for a date conversion.
I've tried matching the schema of the tables with the exception of the to_char conversion. I've checked and double checked the column exists.
This is the code I'm trying:
INSERT INTO weathergrids (location, dates, temperature, rh, wd, ws, df, cu, cc)
SELECT
i.location AS location,
i.dates as dates,
i.temperature as temperature,
i.rh as rh,
i.winddir as winddir,
i.windspeed as windspeed,
i.droughtfactor as droughtfactor,
i.curing as curing,
i.cloudcover as cloudcover
FROM (
SELECT location,
to_char(to_timestamp(dates, 'YYYY-DD-MM HH24:MI'), 'HH24:MI YYYY-MM-DD HH24:MI'),
temperature, rh, wd, ws, df, cu, cc
FROM wosweathergrids
) i;
The error I'm receiving is:
ERROR: column i.dates does not exist
LINE 4: i.dates as dates,
^
SQL state: 42703
Character: 151
My data schema is like:
+-----------------+-----+-------------+-----------------------------+-----+
| TABLE | NUM | COLNAME | DATATYPE | LEN |
+-----------------+-----+-------------+-----------------------------+-----+
| weathergrids | 1 | id | integer | 32 |
| weathergrids | 2 | location | numeric | 6 |
| weathergrids | 3 | dates | timestamp without time zone | |
| weathergrids | 4 | temperature | numeric | 3 |
| weathergrids | 5 | rh | numeric | 4 |
| weathergrids | 6 | wd | numeric | 4 |
| weathergrids | 7 | wsd | numeric | 4 |
| weathergrids | 8 | df | numeric | 4 |
| weathergrids | 9 | cu | numeric | 4 |
| weathergrids | 10 | cc | numeric | 4 |
| wosweathergrids | 1 | id | integer | 32 |
| wosweathergrids | 2 | location | numeric | 6 |
| wosweathergrids | 3 | dates | character varying | 16 |
| wosweathergrids | 4 | temperature | numeric | 3 |
| wosweathergrids | 5 | rh | numeric | 4 |
| wosweathergrids | 6 | wd | numeric | 4 |
| wosweathergrids | 7 | ws | numeric | 4 |
| wosweathergrids | 8 | df | numeric | 4 |
| wosweathergrids | 9 | cu | numeric | 4 |
| wosweathergrids | 10 | cc | numeric | 4 |
+-----------------+-----+-------------+-----------------------------+-----+
Your derived table (sub-query) named i has no column named dates because the column dates is "hidden" in the to_char() function and as it does not define an alias for that expression, no column dates is available "outside" of the derived table.
But I don't see the reason for a derived table to begin with. Also: aliasing a column with the same name is also unnecessary i.location as location is exactly the same thing as i.location.
So your query can be simplified to:
INSERT INTO weathergrids (location, dates, temperature, rh, wd, ws, df, cu, cc)
SELECT
location,
to_timestamp(dates, 'YYYY-DD-MM HH24:MI'),
temperature,
rh,
winddir,
windspeed,
droughtfactor,
curing,
cloudcover
FROM wosweathergrids
You don't need to give an alias to the to_timestamp() expression as the column are matched by position, not by name in an insert ... select statement.

Lead on datastage for timestamp?

I need to use the Lead function on Datastage with the column timestamp, but when I need to put the next column substriting-10 minutes, if someone did it before I tried, and I had also, that can't put strings on the timestamp.
How can I solve this problem?
Input
Id | date. |
1 | 01/04/2016 13:45:25|
2 | 10/04/2016 01:25:36|
3 | 26/10/2017 22:35:13|
Output
Id| date. | Befor. |
1 | 01/04/2016 13:45:25 | 10/04/2016 01:15:36|
2 | 10/04/2016 01:25:36 | 26/10/2017 22:35:13|
3 | 26/10/2017 22:35:13 |null |

Tableau - Calculated field for difference between date and maximum date in table

I have the following table that I have loaded in Tableau (It has only one column CreatedOnDate)
+-----------------+
| CreatedOnDate |
+-----------------+
| 1/1/2016 |
| 1/2/2016 |
| 1/3/2016 |
| 1/4/2016 |
| 1/5/2016 |
| 1/6/2016 |
| 1/7/2016 |
| 1/8/2016 |
| 1/9/2016 |
| 1/10/2016 |
| 1/11/2016 |
| 1/12/2016 |
| 1/13/2016 |
| 1/14/2016 |
+-----------------+
I want to be able to find the maximum date in the table, compare it with every date in the table and get the difference in days. For the above table, the maximum date in table is 1/14/2016. Every date is compared to 1/14/2016 to find the difference.
Expected Output
+-----------------+------------+
| CreatedOnDate | Difference |
+-----------------+------------+
| 1/1/2016 | 13 |
| 1/2/2016 | 12 |
| 1/3/2016 | 11 |
| 1/4/2016 | 10 |
| 1/5/2016 | 9 |
| 1/6/2016 | 8 |
| 1/7/2016 | 7 |
| 1/8/2016 | 6 |
| 1/9/2016 | 5 |
| 1/10/2016 | 4 |
| 1/11/2016 | 3 |
| 1/12/2016 | 2 |
| 1/13/2016 | 1 |
| 1/14/2016 | 0 |
+-----------------+------------+
My goal is to create this Difference calculated field. I am struggling to find a way to do this using DATEDIFF.
And help would be appreciated!!
woodhead92, this approach would work, but means you have to use table calculations. Much more flexible approach (available since v8) is Level of Details expressions:
First, define a MAX date for the whole dataset with this calculated field called MaxDate LOD:
{FIXED : MAX(CreatedOnDate) }
This will always calculate the maximum date on table (will overwrite filters as well, if you need to reflect them, make sure you add them to context.
Then you can use pretty much the same calculated field, but no need for ATTR or Table Calculations:
DATEDIFF('day', [CreatedOnDate], [MaxDate LOD])
Hope this helps!

Error in org-table-sum org-mode?

I am just getting started with Emacs org-mode and I am already getting really confused about a simple column sum (org-table-sum). I start with
| date | sum |
|------+-------|
| | 16.2 |
| | 6.16 |
| | 6.16 |
| | |
When I hit C-c + (org-table-sum) below the second column I get the correct sum 28.52. If I add another line to make it
| date | sum |
|------+-------|
| | 16.2 |
| | 6.16 |
| | 6.16 |
| | 13.11 |
| | |
C-c + gives me 41.629999999999995. ???
If I change the last line from 13.11to 13.12, C-c +will give me (the correct) 41.64.
WTF?
Any explanation appreciated! Thanks!
Most decimal numbers cannot be represented exactly in binary floating point encoding (either single or double precision).
Test 13.11 here, to see that after conversion to double precision, the nearest number represented is 13.109999656677246.
This problem is not emacs related, but is a fundamental issue when working with floating point representation in a different base (binary rather than decimal).
Using calc's vsum, the result is OK:
| date | sum |
|------+-------|
| | 16.2 |
| | 6.16 |
| | 6.16 |
| | 13.11 |
|------+-------|
| | 41.63 |
#+TBLFM: #6$2=vsum(#I..#II)
This works because calc works with arbitrary precision and will not encode the numbers in a binary floating point format.