Convert cursor into pyspark dataframe - pyspark

I have been using this to connect to our organisation's cluster to query the data. is it possible to convert the cursor output directly into pyspark dataframe.
jarFile="/dataflame/nas_nfs/tmp/lib/olympus-jdbc-driver.jar"
url = "olympus:jdbcdriver"
env='uat'
print("Using environment", env.upper())
className = "net.vk.olympus.jdbc.driver.OlympusDriver"
conn = jaydebeapi.connect(className, url,{'username':userid,'password':pwd,'ENV':env,'datasource':'HIVE','EnableLog':'false'},jarFile)
cursor = conn.cursor()
query = "select * from abc.defcs123 limit 5"
cursor.execute(query)
pandas_df = as_pandas(cursor)
print(pandas_df)

Related

Add quotation to each element of a list

I need to use a variable that I've created before in spark to select data from a teradata table:
%spark
sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
val query = "select distinct cod_contrato from xxx.contratos"
val df = sqlContext.sql(query)
val dfv = df.select("cod_contrato")
the variable is a string.
So I would like to query the databe usign that vector of strings:
If I use:
%spark
val sql = s"(SELECT * FROM xx2.CONTRATOS where cod_contrato in '$dfv') as query"
I get:
(SELECT * FROM xx2.CONTRATOS where cod_contrato in '[cod_contrato: string]') as query
The desired result would be:
SELECT * FROM xx2.CONTRATOS where cod_contrato in ('11111', '11112' )
How can I transform the vector to a list enclosed by () and with quotation in each element?
thanks
This is my trial. From some dataframe,
val test = df.select("id").as[String].collect
> test: Array[String] = Array(6597, 8011, 2597, 5022, 5022, 6852, 6852, 5611, 14838, 14838, 2588, 2588)
and so the test is now array. Thus, by using mkString,
val sql = s"SELECT * FROM xx2.CONTRATOS where cod_contrato in " + test.mkString("('", "','", "')") + " as query"
> sql: String = SELECT * FROM xx2.CONTRATOS where cod_contrato in ('6597','8011','2597','5022','5022','6852','6852','5611','14838','14838','2588','2588') as query
where the final result is now string.
Make a temp view of the values you want to filter on and then reference it in the query
%spark
sqlContext.setConf("spark.sql.parquet.writeLegacyFormat", "true")
val query = "select distinct cod_contrato from xxx.contratos"
sqlContext.sql(query).selectExpr("cast(cod_contrato as string)").createOrReplaceTempView("dfv_table"")
val sql = "(SELECT * FROM xx2.CONTRATOS where cod_contrato in (select * from dfv_table)) as query"
this will work for the query in spark sql, but will not return a query string. Lamanus's answer should be sufficient if all you want is the query as string

pymongo to RMongo, mongodb sort query

In pymongo I'm doing a sort query like this:
from pymongo import MongoClient
client = MongoClient()
dbase = client[dbname]
collection = dbase[symbol]
start = time.time()
cursor = collection.find().sort([{'_id', -1}]).limit(6000)
data = list(cursor)
Trying to do the same thing in R now...
library("RMongo")
mongo <- mongoDbConnect("dbname", "localhost", 27017)
query = '{sort({_id: -1})}'
output <- dbGetQuery(mongo, "symbol", query, skip=0, limit=6000)
> output
data frame with 0 columns and 0 rows
what is the proper JSON query string format here?
figured it out with mongolite....
library('mongolite')
con <- mongo("collection_name", url = "mongodb://localhost:27017/dbname")
output <- con$find('{}', sort='{"_id":-1}', limit=6000)

passing numeric where parameter condition in postgress using python

I am trying to use postgresql in Python.The query is against a numeric field value in the where condition. The result set is not fetching and giving error ("psycopg2.ProgrammingError: no results to fetch").There are records in the database with agent_id (integer field) > 1.
import psycopg2
# Try to connect
try:
conn=psycopg2.connect("dbname='postgres' host ='localhost')
except:
print "Error connect to the database."
cur = conn.cursor()
agentid = 10000
try:
sql = 'SELECT * from agent where agent_id > %s::integer;'
data = agentid
cur.execute(sql,data)
except:
print "Select error"
rows = cur.fetchall()
print "\nRows: \n"`
for row in rows:``
print " ", row[9]
Perhaps try these things in your code:
conn=psycopg2.connect("dbname=postgres host=localhost user=user_here password=password_here port=port_num_here")
sql = 'SELECT * from agent where agent_id > %s;'
data = (agentid,) # A single element tuple.
then use
cur.execute(sql,data)
Also, I am confused here to what you want to do with this code
for row in rows:``
print " ", row[9]
Do you want to print each row in rows or just the 8th index of rows, from
rows = cur.fetchall()
If you wanted that index, you could
print rows[9]

How to execute multi line sql in spark sql

How can I execute lengthy, multiline Hive Queries in Spark SQL? Like query below:
val sqlContext = new HiveContext (sc)
val result = sqlContext.sql ("
select ...
from ...
");
Use """ instead, so for example
val results = sqlContext.sql ("""
select ....
from ....
""");
or, if you want to format code, use:
val results = sqlContext.sql ("""
|select ....
|from ....
""".stripMargin);
You can use triple-quotes at the start/end of the SQL code or a backslash at the end of each line.
val results = sqlContext.sql ("""
create table enta.scd_fullfilled_entitlement as
select *
from my_table
""");
results = sqlContext.sql (" \
create table enta.scd_fullfilled_entitlement as \
select * \
from my_table \
")
val query = """(SELECT
a.AcctBranchName,
c.CustomerNum,
c.SourceCustomerId,
a.SourceAccountId,
a.AccountNum,
c.FullName,
c.LastName,
c.BirthDate,
a.Balance,
case when [RollOverStatus] = 'Y' then 'Yes' Else 'No' end as RollOverStatus
FROM
v_Account AS a left join v_Customer AS c
ON c.CustomerID = a.CustomerID AND c.Businessdate = a.Businessdate
WHERE
a.Category = 'Deposit' AND
c.Businessdate= '2018-11-28' AND
isnull(a.Classification,'N/A') IN ('Contractual Account','Non-Term Deposit','Term Deposit')
AND IsActive = 'Yes' ) tmp """
It is worth noting that the length is not the issue, just the writing. For this you can use """ as Gaweda suggested or simply use a string variable, e.g. by building it with string builder. For example:
val selectElements = Seq("a","b","c")
val builder = StringBuilder.newBuilder
builder.append("select ")
builder.append(selectElements.mkString(","))
builder.append(" where d<10")
val results = sqlContext.sql(builder.toString())
In addition to the above ways, you can use the below-mentioned way as well:
val results = sqlContext.sql("select .... " +
" from .... " +
" where .... " +
" group by ....
");
Write your sql inside triple quotes, like """ sql code """
df = spark.sql(f""" select * from table1 """)
This is same for Scala Spark and PySpark.

passing python variable to sql statement psycopg2 pandas

I am trying to replace a piece of sql code with a python variable that I will ask a user to generate using a raw_input.
Below is the code i'm using which works great if I set mypythonvariable manually i.e. inputting 344 into the sql code, but if I set the sql as is to mypythonvariable it doesn't work.
The whole sql query is then converted into a pandas dataframe for further messing about with.
Any help on how to do be appreciated.
UPDATE: I just added the %s code into the statement and i'm now getting the error message '': not all arguments converted during string formatting
'
conn = pg.connect(host = "localhost",
port = 1234,
dbname = "somename",
user = "user",
password = "pswd")
mypythonvariable = raw_input("What is your variable number? ")
sql = """
SELECT
somestuff
FROM
sometable
WHERE
something = %s
"""
df = pd.read_sql_query(sql, con=conn,params=mypythonvariable)
thanks to all that looked.
I found the solution.Looks like the params need to be passed as a list.
conn = pg.connect(host = "localhost",
port = 1234,
dbname = "somename",
user = "user",
password = "pswd")
mypythonvariable = raw_input("What is your variable number? ")
sql = """
SELECT
somestuff
FROM
sometable
WHERE
something = %s
"""
df = pd.read_sql_query(sql, con=conn,params=[mypythonvariable])