How to handle NaTs with pandas sqlalchemy and psycopg2 - postgresql

I have a dataframe with NaTs like so that is giving me a DataError: (psycopg2.DataError) invalid input syntax for type timestamp: "NaT": When I try inserting the values into a postgres db
The dataframe
from sqlalchemy import MetaData
from sqlalchemy.dialects.postgresql import insert
import pandas as pd
tst_df = pd.DataFrame({'colA':['a','b','c','a','z', 'q'],
'colB': pd.date_range(end=datetime.datetime.now() , periods=6),
'colC' : ['a1','b2','c3','a4','z5', 'q6']})
tst_df.loc[5, 'colB'] = pd.NaT
insrt_vals = tst_df.to_dict(orient='records')
engine = sqlalchemy.create_engine("postgresql://user:password#localhost/postgres")
connect = engine.connect()
meta = MetaData(bind=engine)
meta.reflect(bind=engine)
table = meta.tables['tstbl']
insrt_stmnt = insert(table).values(insrt_vals)
do_nothing_stmt = insrt_stmnt.on_conflict_do_nothing(index_elements=['colA','colB'])
The code generating the error
results = engine.execute(do_nothing_stmt)
DataError: (psycopg2.DataError) invalid input syntax for type timestamp: "NaT"
LINE 1: ...6-12-18T09:54:05.046965'::timestamp, 'z5'), ('q', 'NaT'::tim...
One possibility mentioned here is to replace the NaT's with None's but as the previous author said it seems a bit hackish.
sqlachemy 1.1.4
pandas 0.19.1
psycopg2 2.6.2 (dt dec pq3 ext lo64)

Did you try to use Pandas to_sql method?
It works for me for the MySQL DB (I presume it'll also work for PostgreSQL):
In [50]: tst_df
Out[50]:
colA colB colC
0 a 2016-12-14 19:11:36.045455 a1
1 b 2016-12-15 19:11:36.045455 b2
2 c 2016-12-16 19:11:36.045455 c3
3 a 2016-12-17 19:11:36.045455 a4
4 z 2016-12-18 19:11:36.045455 z5
5 q NaT q6
In [51]: import pymysql
...: import sqlalchemy as sa
...:
In [52]:
In [52]: db_connection = 'mysql+pymysql://user:password#mysqlhost/db_name'
...:
In [53]: engine = sa.create_engine(db_connection)
...: conn = engine.connect()
...:
In [54]: tst_df.to_sql('zzz', conn, if_exists='replace', index=False)
On the MySQL side:
mysql> select * from zzz;
+------+---------------------+------+
| colA | colB | colC |
+------+---------------------+------+
| a | 2016-12-14 19:11:36 | a1 |
| b | 2016-12-15 19:11:36 | b2 |
| c | 2016-12-16 19:11:36 | c3 |
| a | 2016-12-17 19:11:36 | a4 |
| z | 2016-12-18 19:11:36 | z5 |
| q | NULL | q6 |
+------+---------------------+------+
6 rows in set (0.00 sec)
PS unfortunately i don't have PostgreSQL for testing

Related

PySpark: groupby() count('*') not working as expected, or I'm misunderstanding

I'm trying to get
row of things within categories
row of all things within categories
Below is what I've tried.
# This is PySpark
# df has variables 'id' 'category' 'thing'
# 'category' one : many 'id'
#
# sample data:
# id | category | thing
# alpha | A | X
# alpha | A | X
# alpha | A | Y
# beta | A | X
# beta | A | Z
# beta | A | Z
# gamma | B | X
# gamma | B | Y
# gamma | B | Z
df_count_per_category = df.\
select('category', 'thing').\
groupby('category', 'thing').\
agg(F.count('*').alias('thing_count'))
# Proposition total, to join with df_turnover_segmented
df_total = df.\
select('category').\
groupby('category').\
agg(F.count('*').alias('thing_total'))
df_merge = df.\
join(df_count_per_category,\
(df_count_per_category.thing== df_count_per_category.thing) & \
(df_count_per_category.category== df_count_per_category.category), \
'inner').\
drop(df_count_per_category.thing).\
drop(df_count_per_category.category).\
join(df_total,\
(df.category== df_total.category), \
'inner').\
drop(df_total.category)
df_rate = df_merge.\
withColumn('thing_rate', F.round(F.col('thing_count') / F.col('thing_total'), 3))
I'm expecting thing_count, thing_total, and thing_rate to be the same for same thing since each thing is category exclusive. However, although thing_count is same value across rows, thing_rate is not. Why is that?
This is the R equivalent I would like to achieve:
# This is R
library(tidytable)
df_total = df |>
mutate(.by = c(category, thing),
thing_count = n()) |>
mutate(.by = category,
thing_total = n()) |>
mutate(thing_rate = thing_count / thing_total)
This is the expected result (+/- some columns):
# This is a table
category | thing | thing_count | thing_total | thing_rate
A | X | 3 | 6 | 0,5
A | Y | 1 | 6 | 0,1667
A | Z | 2 | 6 | 0,3333
B | X | 1 | 3 | 0,3333
B | Y | 1 | 3 | 0,3333
B | Z | 1 | 3 | 0,3333
I think your 2nd join is not what you intend to do.
You are referencing the original df in the 2nd join condition which resulting in creating a wrong association. Instead, you want to join the df_total to the result of the first join.
df_merge = df.\
join(df_count_per_category ,\
(df.thing== df_count_per_category.thing) & \
(df.category== df_count_per_category.category), \
'inner').\
drop(df_count_per_category .thing).\
drop(df_count_per_category .category)
df_merge = df_merge.join(df_total ,\
(df_merge.category== df_total.category), \ # Reference df_merge.category.
'inner').\
drop(df_total.category)
Alternatively, you can achieve your expected dataframe with window functions without multiple joins.
from pyspark.sql import Window
from pyspark.sql import functions as F
df = (df.select('category', 'thing',
F.count('*').over(Window.partitionBy('category', 'thing')).alias('thing_count'),
F.count('*').over(Window.partitionBy('category')).alias('thing_total'))
.withColumn('thing_rate', F.round(F.col('thing_count') / F.col('thing_total'), 3)))

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

How convert sequential numerical processing of Cassandra table data to parallel in Spark?

We are doing some mathematical modelling on data from Cassandra table using the spark cassandra connector and the execution is currently sequential to get the output. How do you parallelize this for faster execution?
I'm new to Spark and I tried a few things but I'm unable understand how to use tabular data in map , groupby, reduceby functions. If someone can help explain (with some code snippets) how to parrellize tabular data, it will be really helpful.
import org.apache.spark.sql.{Row, SparkSession}
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class SparkExample(sparkSession: SparkSession, pathToCsv: String) {
private val sparkContext = sparkSession.sparkContext
sparkSession.stop()
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host","127.0.0.1")
.setAppName("cassandra").setMaster("local[*]")
val sc = new SparkContext(conf)
def testExample(): Unit = {
val KNMI_rdd = sc.cassandraTable ("dbks1","knmi_w")
val Table_count = KNMI_rdd.count()
val KNMI_idx = KNMI_rdd.zipWithIndex
val idx_key = KNMI_idx.map{case (k,v) => (v,k)}
var i = 0
var n : Int = Table_count.toInt
println(Table_count)
for ( i <- 1 to n if i < n) {
println(i)
val Row = idx_key.lookup(i)
println(Row)
val firstRow = Row(0)
val yyyy_var = firstRow.get[Int]("yyyy")
val mm_var = firstRow.get[Double]("mm")
val dd_var = firstRow.get[Double]("dd")
val dr_var = firstRow.get[Double]("dr")
val tg_var = firstRow.get[Double]("tg")
val ug_var = firstRow.get[Double]("ug")
val loc_var = firstRow.get[String]("loc")
val pred_factor = (((0.15461 * tg_var) + (0.8954 * ug_var)) / ((0.0000451 * dr_var) + 0.0004487))
println(yyyy_var,mm_var,dd_var,loc_var)
println(pred_factor)
}
}
}
//test data
// loc | yyyy | mm | dd | dr | tg | ug
//-----+------+----+----+-----+-----+----
// AMS | 2019 | 1 | 1 | 35 | 5 | 84
// AMS | 2019 | 1 | 2 | 76 | 34 | 74
// AMS | 2019 | 1 | 3 | 46 | 33 | 85
// AMS | 2019 | 1 | 4 | 35 | 1 | 84
// AMS | 2019 | 1 | 5 | 29 | 0 | 93
// AMS | 2019 | 1 | 6 | 32 | 25 | 89
// AMS | 2019 | 1 | 7 | 42 | 23 | 89
// AMS | 2019 | 1 | 8 | 68 | 75 | 92
// AMS | 2019 | 1 | 9 | 98 | 42 | 86
// AMS | 2019 | 1 | 10 | 92 | 12 | 76
// AMS | 2019 | 1 | 11 | 66 | 0 | 71
// AMS | 2019 | 1 | 12 | 90 | 56 | 85
// AMS | 2019 | 1 | 13 | 83 | 139 | 90
Edit 1:
I tired using map function and I'm able to calculate the mathematical computations, how do I add keys in front of these values which is defined by WeatherId?
case class Weather( loc: String, yyyy: Int, mm: Int, dd: Int,dr: Double, tg: Double, ug: Double)
case class WeatherId(loc: String, yyyy: Int, mm: Int, dd: Int)
val rows = dataset1
.map(line => Weather(
line.getAs[String]("loc"),
line.getAs[Int]("yyyy"),
line.getAs[Int]("mm"),
line.getAs[Int]("dd"),
line.getAs[Double]("dr"),
line.getAs[Double]("tg"),
line.getAs[Double]("ug")
) )
val pred_factor = rows
.map(x => (( ((x.dr * betaz) + (x.tg * betay)) + (x.ug) * betaz)))
Thanks
TL;DR;
Use a Dataframe/Dataset instead of an RDD.
The argument for DFs over RDDs is long but the short of it is that DFs and their structured alternative DS' outperform the low level RDDs.
With the spark-cassandra connector you can configure input split size that dictate the size of partition size in spark, more partitions more parallelism.
val lastdf = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne",
"spark.cassandra.input.split.size_in_mb" -> 48 // smaller size = more partitions
)
).load()

CTE RECURSIVE optimization, how to?

I need to optimize the performance of a commom WITH RECURSIVE query... We can limit the depth of the tree and decompose in many updates, and can also change representation (use array)... I try some options but perhaps there are a "classic optimization solution" that I'm not realizing.
All details
There are a t_up table, to be updated, with a composit primary key (pk1,pk2), one attribute attr and an array of references to primary keys... And a unnested representation t_scan, with the references; like this:
pk1 | pk2 | attr | ref_pk1 | ref_pk2
n | 123 | 1 | |
n | 456 | 2 | |
r | 123 | 1 | w | 123
w | 123 | 5 | n | 456
r | 456 | 2 | n | 123
r | 123 | 1 | n | 111
n | 111 | 4 | |
... | ...| ... | ... | ...
There are no loops.
UPDATE t_up SET x = pairs
FROM (
WITH RECURSIVE tree as (
SELECT pk1, pk2, attr, ref_pk1, ref_pk2,
array[array[0,0]]::bigint[] as all_refs
FROM t_scan
UNION ALL
SELECT c.pk1, c.pk2, c.attr, c.ref_pk1, c.ref_pk2
,p.all_refs || array[c.attr,c.pk2]
FROM t_scan c JOIN tree p
ON c.ref_pk1=p.pk1 AND c.ref_pk2=p.pk2 AND c.pk2!=p.pk2
AND array_length(p.all_refs,1)<5 -- 5 or 6 avoiding endless loops
)
SELECT pk1, pk2, array_agg_cat(all_refs) as pairs
FROM (
SELECT distinct pk1, pk2, all_refs
FROM tree
WHERE array_length(all_refs,1)>1 -- ignores initial array[0,0].
) t
GROUP BY 1,2
ORDER BY 1,2
) rec
WHERE rec.pk1=t_up.pk1 AND rec.pk2=t_up.pk2
;
To test:
CREATE TABLE t_scan(
pk1 char,pk2 bigint, attr bigint,
ref_pk1 char, ref_pk2 bigint
);
INSERT INTO t_scan VALUES
('n',123, 1 ,NULL,NULL),
('n',456, 2 ,NULL,NULL),
('r',123, 1 ,'w' ,123),
('w',123, 5 ,'n' ,456),
('r',456, 2 ,'n' ,123),
('r',123, 1 ,'n' ,111),
('n',111, 4 ,NULL,NULL);
Running only rec you will obtain:
pk1 | pk2 | pairs
-----+-----+-----------------
r | 123 | {{0,0},{1,123}}
r | 456 | {{0,0},{2,456}}
w | 123 | {{0,0},{5,123}}
But, unfortunately, to appreciate the "Big Data performance problem", you need to see it in a real database... I am preparing a public Github that run with OpenStreetMap Big Data.

Best way to join multiples small tables with a big table in Spark SQL

I'm doing a join multiples tables using spark sql. One of the table is very big and the others are small (10-20 records). really I want to replace values in the biggest table using others tables that contain pairs of key-value.
i.e.
Bigtable:
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| A1 | B1 | C1 | D1 | ....
| A2 | B1 | C2 | D2 | ....
| A1 | B1 | C3 | D2 | ....
| A2 | B2 | C3 | D1 | ....
| A1 | B2 | C2 | D1 | ....
.
.
.
.
.
Table2:
| Col 1 | Col 2
----------------
| A1 | 1a
| A2 | 2a
Table3:
| Col 1 | Col 2
----------------
| B1 | 1b
| B2 | 2b
Table3:
| Col 1 | Col 2
----------------
| C1 | 1c
| C2 | 2c
| C3 | 3c
Table4:
| Col 1 | Col 2
----------------
| D1 | 1d
| D2 | 2d
Expected table is
| Col 1 | Col 2 | Col 3 | Col 4 | ....
--------------------------------------
| 1a | 1b | 1c | 1d | ....
| 2a | 1b | 2c | 2d | ....
| 1a | 1b | 3c | 2d | ....
| 2a | 2b | 3c | 1d | ....
| 1a | 2b | 2c | 1d | ....
.
.
.
.
.
My question is; which is best way to join the tables. (Think that there are 100 or more small tables)
1) Collecting the small dataframes, to transforming it to maps, broadcasting the maps and transforming the big datataframe in one only step
bigdf.transform(ds.map(row => (small1.get(row.col1),.....)
2) Broadcasting the tables and making join using select method.
spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
3) Broadcasting the tables and Concatenate multiples joins
bigtable.join(broadcast(small1), bigtable('col1') ==small1('col1')).join...
Thanks in advance
You might do:
broadcast all small tables (automaticaly done by setting spark.sql.autoBroadcastJoinThreshold slightly superior to the small table number of rows)
run a sql query that join the big table such
val df = spark.sql("
select *
from bigtable
left join small1 using(id1)
left join small2 using(id2)")
EDIT:
Choosing between sql and spark "dataframe" syntax:
The sql syntax is more readable, and less verbose than the spark syntax (for a database user perspective.)
From a developper perspective, dataframe syntax might be more readeble.
The main advantage of using the "dataset" syntax, is the compiler will be able to track some error. Using any string syntax such sql or columns name (col("mycol")) will be spotted at run time.
Best way, as already written in answers, to broadcast all small tables. It can also be done in SparkSQL using BROADCAST hint:
val df = spark.sql("""
select /*+ BROADCAST(t2, t3) */
*
from bigtable t1
left join small1 t2 using(id1)
left join small2 t3 using(id2)
""")
If the data in your small tables is less than the threshold size and physical files for your data is in parquet format then spark will automatically broadcast the small tables but if you are reading the data from some other data sources like sql, PostgreSQL etc. then some times spark does not broadcast the table automatically.
If you know that the tables are small sized and the size of table is not expected to increase ( In case of lookup tables) you can explicitly broadcast the data frame or table and in this way you can efficiently join a larger table with the small tables.
you can verify that the small table is getting broadcasted using the explain command on the data frame or you can do that from Spark UI also.