pg8000 get inserted id into dataframe - pyspark

I'm trying to insert rows of dataframe in postgres databases and insert the generated primary keys in this dataframe.
I'm doing this :
def createConnexionRds():
host = "..."
database = "..."
conn = pg8000.connect(
user="...",
host=host,
database=database,
password="...",
ssl_context=True)
return conn
def insertProcess(r):
conn = createConnexionRds()
insertResults = conn.run(r["tmp_query"])
insertResult = "NOT_INSERTED"
if len(insertResults) > 0:
insertResult = insertResults[0][0]
conn.commit()
conn.close()
return insertResult
def insertPerQuery(myDataframe):
query = sf.lit("insert into tabAAA (colBBB) values ('valueCCC') returning idAAA")
myDataframe = myDataframe.withColumn("tmp_query", query)
myDataframe = myDataframe.drop("idAAA")
rdd=myDataframe.rdd.map(
lambda x:(*x, insertProcess(x))
)
myDataframe = myDataframe.withColumn("idAAA", sf.lit(""))
myDataframe = sqlContext.createDataFrame(rdd,myDataframe.schema)
myDataframe = myDataframe.drop("tmp_query")
return myDataframe
df = insertPerQuery(df)
# df.show(100, False)
The issue is when I comment df.show(...) (the last line), the insert is not process. And if I launch a second df.show(), the insert is duplicate.
This is for a AWS glue job.
Thanks.

This is due to the lazy-evaluation-nature of Spark. The code gets only executed on the executors as soon you call an action, in this case .show()

Related

Print out all the data within a TableQuery[Restaurants]

def displayTable(table: TableQuery[Restaurants]): Unit = {
val tablequery = table.map(_.id)
val action = tablequery.result
val result = db.run(action)
result.foreach(id => id.foreach(new_id => println(new_id)))
total_points = total_points + 10
}
I have tried to print out all the data to the screen but I have gotten no where. My question is why does nothing print out. I am using Scala and JDBC connection aka Slick. If you remove new_id => println(new_id), you get:
def displayTable(table: TableQuery[Restaurants]): Unit = {
val tablequery = table.map(_.id)
val action = tablequery.result
val result = db.run(action)
result.foreach(id => println(id))
total_points = total_points + 10
}
This code produces an out put like the following: "Vector()". Can someone please help me print out all the data out? I loaded it in using the following code:
def fillTable(): TableQuery[Restaurants] ={
println("Table filled.")
val restaurants = TableQuery[Restaurants]
val setup = DBIO.seq(
restaurants.schema.create
)
val setupFuture = db.run(setup)
val bufferedSource = io.Source.fromFile("src/main/scala/Restaurants.csv")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
var restaurant = new Restaurant(s"${cols(0)}",s"${cols(1)}", s"${cols(2)}",
s"${cols(3)}", s"${cols(4)}",s"${cols(5)}",s"${cols(6)}",
s"${cols(7)}",s"${cols(8)}",s"${cols(9)}")
restaurants.forceInsert(s"${cols(0)}",s"${cols(1)}", s"${cols(2)}",
s"${cols(3)}", s"${cols(4)}",s"${cols(5)}",s"${cols(6)}",
s"${cols(7)}",s"${cols(8)}",s"${cols(9)}")
total_rows = total_rows + 1
This is my first question so I apologize for the format.
The fact that Vector() is your output in the second version of displayTable is a strong hint that your query is returning an empty result, and therefore has no id's to print out. I haven't run your code myself, but I suspect this is because restaurants.forceInsert returns an action, and you need to db.run() it to actually execute the query.
I'm also curious why you create var restaurant = ... but then ignore it, and call forceInsert recreating the tuple from the csv values again. Why not restaurants.forceInsert(restaurant)?

Can you insert data frame from scala into Teradata stored proc?

i'm trying to implement taking a data frame and using that as an input into a stored proc in teradata. here is the code
def dfToStoredProc(store_id : String) = {
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.prepareCall("CALL DB.STORED_PROC(?);")
statement.setString(1, store_id)
statement.execute()
}
val dataFrame = df.toDF()
dataFrame.map(m => dfToStoredProc(m.getLong(0).toString))
However getting an error, can anyone help?
I've realised my mistake, I forgot to add a collect() statement
dataFrame.map(m => dfToStoredProc(m.getLong(0).toString)).collect()
An empty dataframe was getting passed to the stored proc which was raising an error

Scala Tail Recursion java.lang.StackOverflowError

I am iteratively querying a mysql table called txqueue that is growing continuously.
Each successive query only considers rows that were inserted into the txqueue table after the query executed in the previous iteration.
To achieve this, each successive query selects rows from the table where the primary key (seqno field in my example below) exceeds the maximum seqno observed in the previous query.
Any newly inserted rows identified in this way are written into a csv file.
The intention is for this process to run indefinitely.
The tail recursive function below works OK, but after a while it runs into a java.lang.StackOverflowError. The results of each iterative query contains two to three rows and results are returned every second or so.
Any ideas on how to avoid the java.lang.StackOverflowError?
Is this actually something that can/should be achieved with streaming?
Many thanks for any suggestions.
Here's the code that works for a while:
object TXQImport {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://mysqlserveraddress/mysqldb"
val username = "username"
val password = "password"
var connection:Connection = null
def txImportLoop(startID : BigDecimal) : Unit = {
try {
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
val newMaxID = statement.executeQuery("SELECT max(seqno) as maxid from txqueue")
val maxid = new Iterator[BigDecimal] {
def hasNext = newMaxID.next()
def next() = newMaxID.getBigDecimal(1)
}.toStream.max
val selectStatement = statement.executeQuery("SELECT seqno,someotherfield " +
" from txqueue where seqno >= " + startID + " and seqno < " + maxid)
if(startID != maxid) {
val ts = System.currentTimeMillis
val file = new java.io.File("F:\\txqueue " + ts + ".txt")
val bw = new BufferedWriter(new FileWriter(file))
// Iterate Over ResultSet
while (selectStatement.next()) {
bw.write(selectStatement.getString(1) + "," + selectStatement.getString(2))
bw.newLine()
}
bw.close()
}
connection.close()
txImportLoop(maxid)
}
catch {
case e => e.printStackTrace
}
}
def main(args: Array[String]) {
txImportLoop(0)
}
}
Your function is not tail-recursive (because of the catch in the end).
That's why you end up with stack overflow.
You should always annotate the functions you intend to be tail-recursive with #scala.annotation.tailrec - it will fail compilation in case tail recursion is impossible, so that you won't be surprised by it at run time.

using spark to read specific columns data from hbase

I have a table in HBase named as "orders" it has column family 'o' and columns as {id,fname,lname,email}
having row key as id. I am trying to get the value of fname and email only from hbase using spark. Currently what 'i am doing is given below
override def put(params: scala.collection.Map[String, Any]): Boolean = {
var sparkConfig = new SparkConf().setAppName("Connector")
var sc: SparkContext = new SparkContext(sparkConfig)
var hbaseConfig = HBaseConfiguration.create()
hbaseConfig.set("hbase.zookeeper.quorum", ZookeeperQourum)
hbaseConfig.set("hbase.zookeeper.property.clientPort", zookeeperPort)
hbaseConfig.set(TableInputFormat.INPUT_TABLE, schemdto.tableName);
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname,o:email");
var hBaseRDD = sc.newAPIHadoopRDD(hbaseConfig, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
try {
hBaseRDD.map(tuple => tuple._2).map(result => result.raw())
.map(f => KeyValueToString(f)).saveAsTextFile(sink)
true
} catch {
case _: Exception => false
}
}
def KeyValueToString(keyValues: Array[KeyValue]): String = {
var it = keyValues.iterator
var res = new StringBuilder
while (it.hasNext) {
res.append( Bytes.toString(it.next.getValue()) + ",")
}
res.substring(0, res.length-1);
}
But nothing is returned and If I try to fetch only one column such as
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname");
then it returns all the values of column fname
So my question is how to get multiple columns from hbase using spark
Any help will be appreciated.
List of columns to scan needs to be space-delimited, according to the documentation.
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname o:email");

Slick 2.1: Return query results as a map [duplicate]

I have methods in my Play app that query database tables with over hundred columns. I can't define case class for each such query, because it would be just ridiculously big and would have to be changed with each alter of the table on the database.
I'm using this approach, where result of the query looks like this:
Map(columnName1 -> columnVal1, columnName2 -> columnVal2, ...)
Example of the code:
implicit val getListStringResult = GetResult[List[Any]] (
r => (1 to r.numColumns).map(_ => r.nextObject).toList
)
def getSomething(): Map[String, Any] = DB.withSession {
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "myTable").head.getColumns.list.map(_.column)
val result = sql"""SELECT * FROM myTable LIMIT 1""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
This is not a problem when query only runs on a single database and single table. I need to be able to use multiple tables and databases in my query like this:
def getSomething(): Map[String, Any] = DB.withSession {
//The line below is no longer valid because of multiple tables/databases
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "table1").head.getColumns.list.map(_.column)
val result = sql"""
SELECT *
FROM db1.table1
LEFT JOIN db2.table2 ON db2.table2.col1 = db1.table1.col1
LIMIT 1
""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
The same approach can no longer be used to retrieve column names. This problem doesn't exist when using something like PHP PDO or Java JDBCTemplate - these retrieve column names without any extra effort needed.
My question is: how do I achieve this with Slick?
import scala.slick.jdbc.{GetResult,PositionedResult}
object ResultMap extends GetResult[Map[String,Any]] {
def apply(pr: PositionedResult) = {
val rs = pr.rs // <- jdbc result set
val md = rs.getMetaData();
val res = (1 to pr.numColumns).map{ i=> md.getColumnName(i) -> rs.getObject(i) }.toMap
pr.nextRow // <- use Slick's advance method to avoid endless loop
res
}
}
val result = sql"select * from ...".as(ResultMap).firstOption
Another variant that produces map with not null columns (keys in lowercase):
private implicit val getMap = GetResult[Map[String, Any]](r => {
val metadata = r.rs.getMetaData
(1 to r.numColumns).flatMap(i => {
val columnName = metadata.getColumnName(i).toLowerCase
val columnValue = r.nextObjectOption
columnValue.map(columnName -> _)
}).toMap
})