Create timeline chart based on annual data - charts

How does one automatically create a continuous chart over time based on data that is only available as each year?
For example, most data comes in the form of the following:
j | f | m | a | m | j | j | a | s | o | n | d |
year 1 x y
year 2 z
year 3
However, in order to create a chart than spans multiple years, I need the data transposed and combined as below:
year 1 | j x
| f y
| m
| a
| m
| j
| j
| a
| s
| o
| n
| d
year 2 | j
| f
| m z
| a
| m
| j
| j
| a
| s
| o
| n
| d
year 3 | j
| f
| m
| a
| m
| j
| j
| a
| s
| o
| n
| d
Is there a simple way to do this with pivot tables or something else?

Assuming your data is in Sheet1. In Sheet2 A2 put this:
=TRANSPOSE(SPLIT(ArrayFormula(JOIN(" , , , , , , , , , , , ,",FILTER(Sheet1!A2:A,Sheet1!A2:A <> ""))),","))
In Sheet2 B2 put:
=ArrayFormula(transpose(split(rept(join(",",transpose(Sheet1!B1:N1)),countA(Sheet1!A2:A)),",")))
These will expand if years are added.
To get the data, this will work, however, a query will need to be added for each additional year:
=transpose({query(Sheet1!B2:M2,"select *"),query(Sheet1!B3:M3,"select *"),query(Sheet1!B4:M4,"select *")})

Related

Pyspark - redistribute percentages

I have a table like the following:
city | center | qty_out | qty_out %
----------------------------------------
A | 1 | 10 | .286
A | 2 | 2 | .057
A | 3 | 23 | .657
B | 1 | 40 | .8
B | 2 | 10 | .2
city-center is unique/the primary key.
If any center within a city has a qty_out % of less than 10% (.10), I want to ignore it and redistribute its % among the other centers of the city. So the result above would become
city | center | qty_out_%
----------------------------------------
A | 1 | .3145
A | 3 | .6855
B | 1 | .8
B | 2 | .2
How can I go about this? I was thinking a window function to partition but can't think of a window function to use with this
column_list = ["city","center"]
w = Window.partitionBy([col(x) for x in column_list]).orderBy('qty_out_%')
I am not statistician, so I cannot comment on the equation, however, if I write the Spark SQL as literally as you mentioned, it'll be like this.
w = Window.partitionBy('city')
redist_cond = F.when(F.col('qty_out %') < 0.1, F.col('qty_out %'))
df = (df.withColumn('redist', F.sum(redist_cond).over(w) / (F.count('*').over(w) - F.count(redist_cond).over(w)))
.fillna(0, subset=['redist'])
.filter(F.col('qty_out %') >= 0.1)
.withColumn('qty_out %', redist_cond.otherwise(F.col('qty_out %') + F.col('redist')))
.drop('redist'))

What is the order of evaluation of conditions combining AND/OR in PostgreSQL?

Suppose I have a query combining AND and OR conditions without parenthesis:
SELECT * FROM tbl1
WHERE a = 1 AND b = 2 OR c = 3;
How does PostgreSQL evaluate these conditions? Like (a = 1 AND b = 2) OR c = 3 or a = 1 AND (b = 2 OR c = 3). I couldn't find it anywhere in the documentation.
Note: I'm not purposefully writing an ambiguous query like this. I'm building a tool where the user could potentially create a query like that.
Note 2: If it makes any difference, I'm using PostgreSQL 9.6 in one instance and 11 in another.
AND is stronger than OR, so:
a AND b OR c == (a AND b) OR c
demo:db<>fiddle
a | b | c | a AND b OR c | (a AND b) OR c | a AND (b OR c)
:- | :- | :- | :----------- | :------------- | :-------
f | f | f | f | f | f
f | f | t | t | t | f
f | t | f | f | f | f
f | t | t | t | t | f
t | f | f | f | f | f
t | f | t | t | t | t
t | t | f | t | t | t
t | t | t | t | t | t
That, of course, means in your case:
a = 1 AND b = 2 OR c = 3 == (a = 1 AND b = 2) OR c = 3

Extract time intervals in a scala spark dataframe

I'm trying to extract combined data intervals based on a time series in scala and spark
I have the following data in a dataframe:
Id | State | StartTime | EndTime
---+-------+---------------------+--------------------
1 | R | 2019-01-01T03:00:00 | 2019-01-01T11:30:00
1 | R | 2019-01-01T11:30:00 | 2019-01-01T15:00:00
1 | R | 2019-01-01T15:00:00 | 2019-01-01T22:00:00
1 | W | 2019-01-01T22:00:00 | 2019-01-02T04:30:00
1 | W | 2019-01-02T04:30:00 | 2019-01-02T13:45:00
1 | R | 2019-01-02T13:45:00 | 2019-01-02T18:30:00
1 | R | 2019-01-02T18:30:00 | 2019-01-02T22:45:00
I need to extract the data into time intervals based on the id and state. The resulting data needs to look like:
Id | State | StartTime | EndTime
---+-------+---------------------+--------------------
1 | R | 2019-01-01T03:00:00 | 2019-01-01T22:00:00
1 | W | 2019-01-01T22:00:00 | 2019-01-02T13:45:00
1 | R | 2019-01-02T13:45:00 | 2019-01-02T22:45:00
Note that the first three records have been grouped together because the equipment is contiguously in an R state from 2019-01-01T03:00:00 to 2019-01-01T22:00:00, then it switches to a W state for the next two records from 2019-01-01T22:00:00 to 2019-01-02T13:45:00and then back to an R state for the last two records.
So it turns out that the answer to this is Combine rows when the end time of one is the start time of another (Oracle) translated to Spark.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col,row_number}
import spark.implicits._
val idSpec = Window.partitionBy('Id).orderBy('StartTime)
val idStateSpec = Window.partitionBy('Id,'State).orderBy('StartTime)
val df2 = df
.select('Id,'State,'StartTime,'EndTime,
row_number().over(idSpec).as("idRowNumber"),
row_number().over(idStateSpec).as("idStateRowNumber"))
.groupBy('Id,'State,'idRowNumber - 'idStateRowNumber)
.agg(min('StartTime).as("StartTime"), max('EndTime).as("EndTime"))

how to get multiple rows from one row in spark scala [duplicate]

This question already has an answer here:
Flattening Rows in Spark
(1 answer)
Closed 5 years ago.
I have a dataframe in spark like below and I want to convert all the column in different rows with respect to first column id.
+----------------------------------+
| id code1 code2 code3 code4 code5 |
+----------------------------------+
| 1 A B C D E |
| 1 M N O P Q |
| 1 P Q R S T |
| 2 P A C D F |
| 2 S D F R G |
+----------------------------------+
I want the output like below format
+-------------+
| id code |
+-------------+
| 1 A |
| 1 B |
| 1 C |
| 1 D |
| 1 E |
| 1 M |
| 1 N |
| 1 O |
| 1 P |
| 1 Q |
| 1 P |
| 1 Q |
| 1 R |
| 1 S |
| 1 T |
| 2 P |
| 2 A |
| 2 C |
| 2 D |
| 2 F |
| 2 S |
| 2 D |
| 2 F |
| 2 R |
| 2 G |
+-------------+
Can anyone please help me here how I will get the above output with spark and scala.
using array, explode and drop functions should have you the desired output as
df.withColumn("code", explode(array("code1", "code2", "code3", "code4", "code5")))
.drop("code1", "code2", "code3", "code4", "code5")
OR
as defined by undefined_variable, you can just use select
df.select($"id", explode(array("code1", "code2", "code3", "code4", "code5")).as("code"))
df.select(col("id"),explode(concat_ws(",",Seq(col(code1),col("code2"),col("code3"),col("code4"),col("code5")))))
Basically idea is first concat all required columns and then explode it

Postgresql: Flagging and Identifying Duplicates

I'm trying to find a way to mark duplicated cases similar to this question.
However, instead of counting occurrences of duplicated values, I'd like to mark them as 0 and 1, for duplicated and unique cases respectively. This is very similar to SPSS's identify duplicate cases function. For example if I have a dataset like:
Name State Gender
John TX M
Katniss DC F
Noah CA M
Katniss CA F
John SD M
Ariel FL F
And if I wanted to flag those with duplicated name, so the output would be something like this:
Name State Gender Dup
John TX M 1
Katniss DC F 1
Noah CA M 1
Katniss CA F 0
John SD M 0
Ariel FL F 1
A bonus would be a query statement that will handle which case to pick when determining the unique case.
SELECT name, state, gender
, NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid) AS Is_not_a_dup
FROM names na
;
Explanation: [NOT] EXISTS(...) results in a boolean value (which could be converted to an integer) Casting to boolean requires an extra pair of () , though:
SELECT name, state, gender
, (NOT EXISTS (SELECT 1 FROM names nx
WHERE nx.name = na.name
AND nx.gender = na.gender
AND nx.ctid < na.ctid))::integer AS is_not_a_dup
FROM names na
;
Results:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 6
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | t
Katniss | DC | F | t
Noah | CA | M | t
Katniss | CA | F | f
John | SD | M | f
Ariel | FL | F | t
(6 rows)
name | state | gender | nodup
---------+-------+--------+-------
John | TX | M | 1
Katniss | DC | F | 1
Noah | CA | M | 1
Katniss | CA | F | 0
John | SD | M | 0
Ariel | FL | F | 1
(6 rows)