pymongo to RMongo, mongodb sort query - mongodb

In pymongo I'm doing a sort query like this:
from pymongo import MongoClient
client = MongoClient()
dbase = client[dbname]
collection = dbase[symbol]
start = time.time()
cursor = collection.find().sort([{'_id', -1}]).limit(6000)
data = list(cursor)
Trying to do the same thing in R now...
library("RMongo")
mongo <- mongoDbConnect("dbname", "localhost", 27017)
query = '{sort({_id: -1})}'
output <- dbGetQuery(mongo, "symbol", query, skip=0, limit=6000)
> output
data frame with 0 columns and 0 rows
what is the proper JSON query string format here?

figured it out with mongolite....
library('mongolite')
con <- mongo("collection_name", url = "mongodb://localhost:27017/dbname")
output <- con$find('{}', sort='{"_id":-1}', limit=6000)

Related

Convert cursor into pyspark dataframe

I have been using this to connect to our organisation's cluster to query the data. is it possible to convert the cursor output directly into pyspark dataframe.
jarFile="/dataflame/nas_nfs/tmp/lib/olympus-jdbc-driver.jar"
url = "olympus:jdbcdriver"
env='uat'
print("Using environment", env.upper())
className = "net.vk.olympus.jdbc.driver.OlympusDriver"
conn = jaydebeapi.connect(className, url,{'username':userid,'password':pwd,'ENV':env,'datasource':'HIVE','EnableLog':'false'},jarFile)
cursor = conn.cursor()
query = "select * from abc.defcs123 limit 5"
cursor.execute(query)
pandas_df = as_pandas(cursor)
print(pandas_df)

Null values in qtablewidget

I develop a desktop application on PYQT5 with integration of posgtresql. I stumbled over the situation that the table does not display the values:
source code:
def createTable(self):
self.tableWidget = QTableWidget()
conn = psycopg2.connect('host=localhost port=5432 dbname=postgres user=postgres password=12345678')
cursor = conn.cursor()
query = cursor.execute("SELECT * FROM wave_params")
result = cursor.fetchall()
for i in result:
print(i)
rows = len(result)
columns = len(result[0])
self.tableWidget.setColumnCount(columns)
self.tableWidget.setRowCount(rows)
index = 0
while query != None:
self.tableWidget.setItem(index,0, QTableWidgetItem(query.result[0]))
# self.tableWidget.setItem(index, 1, QTableWidgetItem(str(query.value(1))))
# self.tableWidget.setItem(index, 2, QTableWidgetItem(str(query.value(2))))
index = index + 1
# table selection change
self.tableWidget.doubleClicked.connect(self.on_click)
#pyqtSlot()
def on_click(self):
print("\n")
for currentQTableWidgetItem in self.tableWidget.selectedItems():
print(currentQTableWidgetItem.row(), currentQTableWidgetItem.column(), currentQTableWidgetItem.text())
I can not understand. what is the problem?
Thanks!
the cursor object has no attribute result.
in your code result is a list of tuples containing the return value of cursor.fetchall() and query is the return value of cursor.execute(). cursor.execute() returns always None
(see documentation). You only need to loop over result, here 2 examples:
def createTable(self):
self.tableWidget = QTableWidget(self)
psycopg2.connect('host=localhost port=5432 dbname=postgres user=postgres password=12345678')
cursor = conn.cursor()
query = cursor.execute("SELECT * FROM ladestelle")
result = cursor.fetchall()
rows = len(result)
columns = len(result[0])
self.tableWidget.setColumnCount(columns)
self.tableWidget.setRowCount(rows)
for i, r in enumerate(result):
self.tableWidget.setItem(i, 0, QTableWidgetItem(r[0]))
self.tableWidget.setItem(i, 1, QTableWidgetItem(str(r[1])))
self.tableWidget.setItem(i, 2, QTableWidgetItem(str(r[2])))
'''
# or simpler
for r in range(rows):
for c in range(columns):
self.tableWidget.setItem(r, c, QTableWidgetItem(str(result[r][c])))
'''
self.tableWidget.doubleClicked.connect(self.on_click)

Spark reading from Postgres JDBC table slow

I am trying to load about 1M rows from a PostgreSQL database into Spark. When using Spark it takes about 10s. However, loading the same query using psycopg2 driver takes 2s. I am using postgresql jdbc driver version 42.0.0
def _loadFromPostGres(name):
url_connect = "jdbc:postgresql:"+dbname
properties = {"user": "postgres", "password": "postgres"}
df = SparkSession.builder.getOrCreate().read.jdbc(url=url_connect, table=name, properties=properties)
return df
df = _loadFromPostGres("""
(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298) as
user_series_game""")
print measure(lambda : len(df.collect()))
The output is -
--- 10.7214591503 seconds ---
1076131
Using psycopg2 -
import psycopg2
conn = psycopg2.connect(conn_string)
cur = conn.cursor()
def _exec():
cur.execute("""(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298)""")
return cur.fetchall()
print measure(lambda : len(_exec()))
cur.close()
conn.close()
The output is -
--- 2.27961301804 seconds ---
1076131
The measure function -
def measure(func) :
start_time = time.time()
x = func()
print("--- %s seconds ---" % (time.time() - start_time))
return x
Kindly help me find the cause of this problem.
Edit 1
I did a few more benchmarks. Using Scala and JDBC -
import java.sql._;
import scala.collection.mutable.ArrayBuffer;
def exec() {
val url = ("jdbc:postgresql://prod.caumccqvmegm.ap-southeast-1.rds.amazonaws.com/prod"+
"?tcpKeepAlive=true&prepareThreshold=-1&binaryTransfer=true&defaultRowFetchSize=10000")
val conn = DriverManager.getConnection(url,"postgres","postgres");
val sqlText = """SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298"""
val t0 = System.nanoTime()
val stmt = conn.prepareStatement(sqlText, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
val rs = stmt.executeQuery()
val list = new ArrayBuffer[(Long, Long, Long, Double)]()
while (rs.next()) {
val seriesId = rs.getLong("seriesId")
val companyId = rs.getLong("companyId")
val userId = rs.getLong("userId")
val score = rs.getDouble("score")
list.append((seriesId, companyId, userId, score))
}
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) * 1e-9 + "s")
println(list.size)
rs.close()
stmt.close()
conn.close()
}
exec()
The output was -
Elapsed time: 1.922102285s
1143402
When I did collect() in Spark + Scala -
import org.apache.spark.sql.SparkSession
def exec2() {
val spark = SparkSession.builder().getOrCreate()
val url = ("jdbc:postgresql://prod.caumccqvmegm.ap-southeast-1.rds.amazonaws.com/prod"+
"?tcpKeepAlive=true&prepareThreshold=-1&binaryTransfer=true&defaultRowFetchSize=10000")
val sqlText = """(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298) as user_series_game"""
val t0 = System.nanoTime()
val df = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", sqlText)
.option("user", "postgres")
.option("password", "postgres")
.load()
val list = df.collect()
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) * 1e-9 + "s")
print (list.size)
}
exec2()
The output was
Elapsed time: 1.486141076s
1143445
So 4x amount of extra time is spent within Python serialisation. I understand there will be some penalty, but this seems too much.
The reason is really simple and have two simultaneous reasons.
First I will give you a perpective of how psycopg2 works.
This lib psycopg2 works like any other lib to connect to a RDMS. This lib will send the query to the engine of your postgres and it will return the data to you. Straight foward like this.
Conn -> Query -> ReturnData -> FetchData
When you use spark is a little bit different in two ways. Spark is not like a programatic language that run in one single thread. It has a Distributed System to work. Even if you are running in a local machine. See Spark has a basic concept of Driver(Master) and Workers.
The Driver recieve the request to execute the query to the Postgres, the Driver will not request the data for each worker request the information from your Postgres.
If you see the documentation here you will se a note like this:
Don’t create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
This note means that each worker will be responsible to request the data for your postgres. This is a small overhead of starting this process but nothing really big. But have a overhead here, to send the data to each worker.
Seccond point, your collect in this part of code:
print measure(lambda : len(df.collect()))
The collect function will send a command for all of your workers to send the data to your Driver. To store in the memory of your driver, it is like a Reduce, it creates Shuffle in the middle of the process. Shuffle is the step of the process that the data is send to other workers. In the case of collect each worker will send that to your Driver.
So the steps of Spark in JDBC of your code is:
(Workers)Conn -> (Workers)Query -> (Workers)FetchData -> (Driver)
Request the Data -> (Workers) Shuffle -> (Driver) Collect
Well there in a bunch of other stuffs that happens with the Spark, like the QueryPlan, build the DataFrame and other stuffs.
That is the reason that you have faster response in your simple code of Python than Spark.

passing numeric where parameter condition in postgress using python

I am trying to use postgresql in Python.The query is against a numeric field value in the where condition. The result set is not fetching and giving error ("psycopg2.ProgrammingError: no results to fetch").There are records in the database with agent_id (integer field) > 1.
import psycopg2
# Try to connect
try:
conn=psycopg2.connect("dbname='postgres' host ='localhost')
except:
print "Error connect to the database."
cur = conn.cursor()
agentid = 10000
try:
sql = 'SELECT * from agent where agent_id > %s::integer;'
data = agentid
cur.execute(sql,data)
except:
print "Select error"
rows = cur.fetchall()
print "\nRows: \n"`
for row in rows:``
print " ", row[9]
Perhaps try these things in your code:
conn=psycopg2.connect("dbname=postgres host=localhost user=user_here password=password_here port=port_num_here")
sql = 'SELECT * from agent where agent_id > %s;'
data = (agentid,) # A single element tuple.
then use
cur.execute(sql,data)
Also, I am confused here to what you want to do with this code
for row in rows:``
print " ", row[9]
Do you want to print each row in rows or just the 8th index of rows, from
rows = cur.fetchall()
If you wanted that index, you could
print rows[9]

passing python variable to sql statement psycopg2 pandas

I am trying to replace a piece of sql code with a python variable that I will ask a user to generate using a raw_input.
Below is the code i'm using which works great if I set mypythonvariable manually i.e. inputting 344 into the sql code, but if I set the sql as is to mypythonvariable it doesn't work.
The whole sql query is then converted into a pandas dataframe for further messing about with.
Any help on how to do be appreciated.
UPDATE: I just added the %s code into the statement and i'm now getting the error message '': not all arguments converted during string formatting
'
conn = pg.connect(host = "localhost",
port = 1234,
dbname = "somename",
user = "user",
password = "pswd")
mypythonvariable = raw_input("What is your variable number? ")
sql = """
SELECT
somestuff
FROM
sometable
WHERE
something = %s
"""
df = pd.read_sql_query(sql, con=conn,params=mypythonvariable)
thanks to all that looked.
I found the solution.Looks like the params need to be passed as a list.
conn = pg.connect(host = "localhost",
port = 1234,
dbname = "somename",
user = "user",
password = "pswd")
mypythonvariable = raw_input("What is your variable number? ")
sql = """
SELECT
somestuff
FROM
sometable
WHERE
something = %s
"""
df = pd.read_sql_query(sql, con=conn,params=[mypythonvariable])