I have a geopandas dataframe gdf_shp with columns: 'name_', 'kwd_ypes', 'geom', 'fid'. My question is if how to set the fid as primary key and if there is a geometry type in the sqlalchemy.types package. Here is my code:
import psycopg2
import geopandas as gpd
from sqlalchemy import create_engine, types
shapefile = 'file.shp'
gdf_shp = gpd.read_file(shapefile, encoding='windows-1253')
gdf_shp['fid'] = gdf_shp.index
gdf_shp.rename(columns={'NAME': 'name_',
'KWD_YPES': 'kwd_ypes',
'geometry': 'geom'}, inplace=True)
gdf_shp.set_geometry('geom', inplace='True')
engine = create_engine(...)
gdf_shp.to_postgis(name=table_name,
con=engine,
dtype={'name_': types.VARCHAR(),
'kwd_ypes': types.INTEGER(),
'geom': 'geometry',
'fid': types.INTEGER(primary_key=True)},
if_exists='replace')
I was also wondering if I could skip somehow the
gdf_shp.set_geometry('geom', inplace='True')
line of code by setting the column to some short of geometry type in the dtype argument of to_postgis.
Related
edit:
In postgresql.conf, the log_statement is set to:
#log_statement = 'none' # none, ddl, mod, all
My objective is to COPY a .cvs file containing ~300k records to Postgres.
I am running the script below and nothing happens; no error or warning but still the csv is not uploaded.
Any thoughts?
import psycopg2
# Try to connect
try:
conn = psycopg2.connect(database="<db>", user="<user>", password="<pwd>", host="<host>", port="<port>")
print("Database Connected....")
except:
print("Unable to Connect....")
cur = conn.cursor()
try:
sqlstr = "COPY \"HISTORICALS\".\"HISTORICAL_DAILY_MASTER\" FROM STDIN DELIMITER ',' CSV"
with open('/Users/kevin/Dropbox/Stonks/HISTORICALS/dump.csv') as f:
cur.copy_expert(sqlstr, f)
conn.commit()
print("COPY pass")
except:
print("Unable to COPY...")
# Close communication with the database
cur.close()
conn.close()
This is what my .csv looks like
Thanks!
Kevin
I suggest you to load in first time your df with pandas
import pandas as pd
import psycopg2
conn = psycopg2.connect(database="<db>", user="<user>", password="<pwd>", host="<host>", port="<port>")
cur = conn.cursor()
df = pd.read_csv('data.csv')
cur.copy_from(df, schema , null='', sep=',/;', columns=(df.columns))
For the part columns=(df.columns) I forgot if they want turple or list but should work with a conversion and you should read this
Pandas dataframe to PostgreSQL table using psycopg2 without SQLAlchemy? who could help you
I have a csv file that have the following:
id_1
id_2
date
FD345
MER3345
06/12/2020
i want to connect id_1 -> id_2
the edge between them should be the date see the below pic
see id_1 it have a direct connected edge to id_2
the edge between them should be the date
so what i did is something like that:
import networkx as nx
import pandas as pd
df = pd.read_csv('data.csv')
G = nx.from_pandas_edgelist(df, source = "id_1", target = "id_2", edge_attr='date', create_using=nx.DiGraph())
but in this way it did not connect the node_1 and node_2 by date it give only the attributes to be date!!
or i am not understanding correctly because the output if i did like this when i print G.edges()
('UCU6lC', 'vOGN5A'), ........
it connect the nodes but i am not sure if it connected with the date or not!
Thank you for clear out something to me.
You need to use draw_networkx_edge_labels() to draw edge labels.
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
df = pd.DataFrame({'id_1': ['FD345'],
'id_2': ['MER3345'],
'date': ['06/12/2020']
})
G = nx.from_pandas_edgelist(df, source="id_1", target="id_2", create_using=nx.DiGraph())
nx.draw_networkx(G)
nx.draw_networkx_edge_labels(G, nx.spring_layout(G), edge_labels=dict(zip(G.edges, df['date'].tolist())),
verticalalignment='center_baseline')
plt.show()
I had written a SQL query which is has a subquery in it. It is a correct mySQL query but it does not get implemented on Pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql.functions import *
sc = spark.sparkContext
sqlcontext = HiveContext(sc)
select location, postal, max(spend), max(revenue)
from (select a.*,
(select sum(r.revenue)
from revenue r
where r.user = a.user and
r.dte >= a.dt - interval 10 minute and
r.dte <= a.dte + interval 10 minute
) as revenue
from auction a
where a.event in ('Mid', 'End', 'Show') and
a.cat_id in (3) and
a.cat = 'B'
) a
group by location, postal;
The error eveytime I am getting is
AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate [sum(cast(revenue#17 as double)) AS sum(CAST(revenue AS DOUBLE))#498]\n+- Filter (((user#2 = outer(user#85)) && (dt#0 >= cast(cast(outer(dt#67) - interval 10 minutes as timestamp) as string))) && ((dt#0 <= cast(cast(outer(dt#67) + interval 10 minutes as timestamp) as string))
Any insights on this will be helpful.
Correlated subquery using sql syntax in PySpark is not an option, so in this case I ran the queries seperately with some twigs in sql query and left joined it using df.join to get the desired output through PySpark, this is how this issue was resolved
I am working with the script below.
If I change the script so I avoid the bytea datatype, I can easily copy data from my postgres table into a python variable.
But if the data is in a bytea postgres column, I encounter a strange object called memory which confuses me.
Here is the script which I run against anaconda python 3.5.2:
# bytea.py
import sqlalchemy
# I should create a conn
db_s = 'postgres://dan:dan#127.0.0.1/dan'
conn = sqlalchemy.create_engine(db_s).connect()
sql_s = "drop table if exists dropme"
conn.execute(sql_s)
sql_s = "create table dropme(c1 bytea)"
conn.execute(sql_s)
sql_s = "insert into dropme(c1)values( cast('hello' AS bytea) );"
conn.execute(sql_s)
sql_s = "select c1 from dropme limit 1"
result = conn.execute(sql_s)
print(result)
# <sqlalchemy.engine.result.ResultProxy object at 0x7fcbccdade80>
for row in result:
print(row['c1'])
# <memory at 0x7f4c125a6c48>
How to get the data which is inside of memory at 0x7f4c125a6c48 ?
You can cast it use python bytes()
for row in result:
print(bytes(row['c1']))
Just started using PostgreSQL 9.5 and have ran into my first problem with jsonb column. I have been trying to find an answer to this for a while but failing badly. Can someone help?
I have a json array in python containing json objects like this:
[{"name":"foo", "age":"18"}, {"name":"bar", "age":"18"}]
I'm trying to insert this into a jsonb column like this:
COPY person(person_jsonb) FROM '/path/to/my/json/file.json';
But only 1 row gets inserted. I hope to have each json object in the array as a new row like this:
1. {"name":"foo", "age":"18"}
2. {"name":"bar", "age":"18"}
Also tried:
INSERT INTO person(person_jsonb)
VALUES (%s)
,(json.dumps(data['person'])
Still only one row gets inserted. Can someone please help??
EDIT: Added python code as requested
import psycopg2, sys, json
con = None
orders_file_path = '/path/to/my/json/person.json'
try:
with open(orders_file_path) as data_file:
data = json.load(data_file)
con = psycopg2.connect(...)
cur = con.cursor()
person = data['person']
cur.execute("""
INSERT INTO orders(orders_jsonb)
VALUES (%s)
""", (json.dumps(person), ))
con.commit()
except psycopg2.DatabaseError, e:
if con:
con.rollback()
finally:
if con:
con.close()
person.json file:
{"person":[{"name":"foo", "age":"18"}, {"name":"bar", "age":"18"}]}
Assuming the simplest schema:
CREATE TABLE test(data jsonb);
Option 1: parse the JSON in Python
You need to insert each row in PostgreSQL apart, you could parse the JSON on Python side and split the upper level array, then use cursor.executemany to execute the INSERT with each json data already split:
import json
import psycopg2
con = psycopg2.connect('...')
cur = con.cursor()
data = json.loads('[{"name":"foo", "age":"18"}, {"name":"bar", "age":"18"}]')
with con.cursor() as cur:
cur.executemany('INSERT INTO test(data) VALUES(%s)', [(json.dumps(d),) for d in data])
con.commit()
con.close()
Option 2: parse the JSON in PostgreSQL
Another option is to push the JSON processing into PostgreSQL side using json_array_elements:
import psycopg2
con = psycopg2.connect('...')
cur = con.cursor()
data = '[{"name":"foo", "age":"18"}, {"name":"bar", "age":"18"}]'
with con.cursor() as cur:
cur.execute('INSERT INTO test(data) SELECT * FROM json_array_elements(%s)', (data,))
con.commit()
con.close()