Write a large DataFrame into MongoDB via GridFS - mongodb

I deal with large DataFrames containing both numbers and text. Obviously I could store each column/row as a sep. document in my mongoDB but I would like to remove and hassle when loading the data.
I thought about using GridFS which is kind of abstracted away by using the MongoEngine FileField. In the meantime I came up with a solution that works for me:
import pandas as pd
from mongoengine import Document, StringField, FileField
from io import BytesIO
class Frame(Document):
name = StringField(required=True, max_length=200, unique=True)
data = FileField()
#property
def frame(self):
str_data = BytesIO(self.data.read()).read().decode()
try:
return pd.read_json(str_data, typ="frame")
except ValueError:
return pd.read_json(str_data, typ="series")
def __str__(self):
return "{name}: \n{frame}".format(name=self.name, frame=self.frame)
def put(self, frame):
if self.data:
self.data.replace(frame.to_json().encode())
else:
self.data.new_file()
self.data.write(frame.to_json().encode())
self.data.close()
self.save()
if __name__ == '__main__':
from pydata.config import connect_production
connect_production()
frame = pd.DataFrame(data=[[1,2,4],[3,4,6]], columns=["A","B","C"], index=["X","Y"])
f = Frame.objects(name="history").update_one(name="history", upsert=True)
f = Frame.objects(name="history").first()
f.put(frame=frame)
print(f)

Related

How can pyspark remember something in memory like class attributes in mapreduce?

I have a table with 2 columns: image_url, comment. The same image may have many comments, and data are sorted by image_url in files.
I need to crawl the image, and transfer it to binary. This will take a long time. So, for the same image, I want to do it only once.
In mapreduce, I can remember the last row and result in memory.
class Mapper:
def __init__(self):
self.image_url = None
self.image_bin = None
def run(self, image_url, comment):
if image_url != self.image_url:
self.image_url = image_url
self.image_bin = process(image_url)
return self.image_url, self.image_bin, comment
How can I do it in pyspark? Either rdd and dataframe is ok.
I would advice you to simply process a grouped version of your dataframe. Something like this :
from pyspark.sql import functions as F
# Assuming df is your dataframe
df = df.groupBy("image_url").agg(F.collect_list("comment").alias("comments"))
df = df.withColumn("image_bin", process(F.col("image_url")))
df.select(
"image_url",
"image_bin",
F.explode("comments").alias("comment"),
).show()
I found mapPartitions works. The code look like this.
def do_cover_partition(partitionData):
last_url = None
last_bin = None
for row in partitionData:
data = row.asDict()
print(data)
if data['cover_url'] != last_url:
last_url = data['cover_url']
last_bin = url2bin(last_url)
print(data['comment'])
data['frames'] = last_bin
yield data
columns = ["id","cover_url","comment","frames"]
df = df.rdd.mapPartitions(do_cover_partition).map(lambda x: [x[c] for c in columns]).toDF(columns)

How to combine multiple TIFF's into one large Geotiff in scala?

I am working on a project for finding water depth and extent using digital ground model (DGM). I have multiple tiff files covering the area of interest and i want to combine them into a single tiff file for quick processing. How can i combine them using my own code below or any other methodology?
I have tried to concatenate the tiles bye getting them as an input one by one and then combining them but it throws GC error probably because there is something wrong with code itself. The code is provided below
import geotrellis.proj4._
import geotrellis.raster._
import geotrellis.raster.io.geotiff._
object waterdepth {
val directories = List("data")
//constants to differentiate which bands to use
val R_BAND = 0
val G_BAND = 1
val NIR_BAND = 2
// Path to our landsat band geotiffs.
def bandPath(directory: String) = s"../biggis-landuse/radar_data/${directory}"
def main(args: Array[String]): Unit = {
directories.map(directory => generateMultibandGeoTiffFile(directory))
}
def generateMultibandGeoTiffFile(directory: String) = {
val tiffFiles = new java.io.File(bandPath(directory)).listFiles.map(_.toString)
val singleBandGeoTiffArray = tiffFiles.foldLeft(Array[SinglebandGeoTiff]())((acc, el:String) => {
acc :+ SinglebandGeoTiff(el)
})
val tileArray = ArrayMultibandTile(singleBandGeoTiffArray.map(_.tile))
println(s"Writing out $directory multispectral tif")
MultibandGeoTiff(tileArray, singleBandGeoTiffArray(0).extent, singleBandGeoTiffArray(0).crs).write(s"data/$directory.tif")
it should be able to create a single tif file from all the seperate files but it throws up a memory error.
The idea you follow is correct, probably OOM happens since you're loading lot's of TIFFs into memory so it is not surprising. The solution is to allocate more memory for the JVM. However you can try this small optimization (that probably will work):
import geotrellis.proj4._
import geotrellis.raster._
import geotrellis.raster.io.geotiff._
import geotrellis.raster.io.geotiff.reader._
import java.io.File
def generateMultibandGeoTiffFile(directory: String) = {
val tiffs =
new File(bandPath(directory))
.listFiles
.map(_.toString)
// streaming = true won't force all bytes to load into memory
// only tiff metadata is fetched here
.map(GeoTiffReader.readSingleband(_, streaming = true))
val (extent, crs) = {
val tiff = tiffs.head
tiff.extent -> tiff.crs
}
// TIFF segments bytes fetch will start only during the write
MultibandGeoTiff(
MultibandTile(tiffs.map(_.tile)),
extent, crs
).write(s"data/$directory.tif")
}
}

mocking snowflake connection

I have a SnowflakeApi class in python which just works as a wrapper on top of the SnowflakeConnection class. My SnowflakeApi is
import logging
import os
from snowflake.connector import connect
class SnowflakeApi(object):
"""
Wrapper to handle snowflake connection
"""
def __init__(self, account, warehouse, database, user, pwd):
"""
Handles snowflake connection. Connection must be closed once it is no longer needed
:param account:
:param warehouse:
:param database:
"""
self.__acct = self._account_url(account)
self.__wh = warehouse
self.__db = database
self.__connection = None
self.__user = user
self.__pwd = pwd
def __create_connection(self):
try:
# set the proxy here
conn = connect(
account=self.__acct
, user=self.__user
, password=self.__pwd
, warehouse=self.__wh
, database=self.__db
)
return conn
except:
raise Exception(
"Unable to connect to snowflake for user: '{0}', warehouse: '{1}', database: '{2}'".format(
self.__user, self.__wh, self.__db))
def get_connection(self):
"""
Gets a snowflake connection. If the connection has already been initialised it is returned
otherwise a new connection is created
:param credentials_func: method to get database credentials.
:return:
"""
try:
if self.__connection is None:
self.__connection = self.__create_connection()
return self.__connection
except:
raise Exception("Unable to initalise Snowflake connection")
def close_connection(self):
"""
Closes snowflake connection.
:return:
"""
self.__connection.close()
Namespace for SnowflakeApi is connection.snowflake_connection.SnowflakeApi (i.e. i have snowflake_connection.py in a folder called connections)
I want to write unit tests for this class using pytest and unittest.mock. The problem is I want to mock 'connect' so that a MagicMock object is returned and no database call is made. So far I have tried:
monkeypatch.setattr(connections.snowflake_connection,"connect",return_value = "")
Changed my original class to just import snowflake. I then created a mock object and used monkeypatch.setattr(snowflake_connection,"snowflake",my_mock_snowflake). That didn't work either
In short, I have tried a couple of other things but nothing has worked. All I want to do is mock snowflake connection so no actual database call is made.
Here is another way where we are mocking snowflake connector, cursor and fetch_all using python mock and patch.
import mock
import unittest
from datetime import datetime, timedelta
import feed_daily_report
class TestFeedDailyReport(unittest.TestCase):
#mock.patch('snowflake.connector.connect')
def test_compare_partner(self, mock_snowflake_connector):
tod = datetime.now()
delta = timedelta(days=8)
date_8_days_ago = tod - delta
query_result = [('partner_1', date_8_days_ago)]
mock_con = mock_snowflake_connector.return_value
mock_cur = mock_con.cursor.return_value
mock_cur.fetchall.return_value = query_result
result = feed_daily_report.main()
assert result == True
An example using unittest.mock and patching the connection:
from unittest import TestCase
from unittest.mock import patch
from connection.snowflake_connection import SnowflakeApi
class TestSnowFlakeApi(TestCase):
#patch('connection.snowflake_connection.connect')
def test_get_connection(self, mock_connect)
api = SnowflakeApi('the_account',
'the_warehouse',
'the_database',
'the_user',
'the_pwd')
api.get_connection()
mock_connect.assert_called_once_with(account='account_url', # Will be the output of self._account_url()
user='the_user',
password='the_pwd',
warehouse='the_warehouse',
database='the_database')
If you're testing other classes that use your SnowFlakeApi wrapper, then you should use the same approach, but patch the SnowFlakeApi itself in those tests.
from package.module.SomeClassThatUsesSnowFlakeApi
class TestSomeClassThatUsesSnowFlakeApi(TestCase):
#patch('package.module.SnowFlakeApi')
def test_some_func(self, mock_api):
instance = SomeClassThatUsesSnowFlakeApi()
instance.do_something()
mock_api.assert_called_once_with(...)
mock_api.return_value.get_connection.assert_called_once_with()
Also note that if you're using Python 2, you will need to pip install mock and then from mock import patch.
Using stubbing and dependency injection
from ... import SnowflakeApi
def some_func(*args, api=None, **kwargs):
api = api or SnowflakeApi(...)
conn = api.get_connection()
# Do some work
return result
Your test
class SnowflakeApiStub(SnowflakeApi)
def __init__(self):
# bypass super constructor
self.__connection = MagicMock()
def test_some_func():
stub = SnowflakeApiStub()
mock_connection = stub.__connection
mock_cursor = mock_connection.cursor.return_value
expect = ...
actual = some_func(api=stub)
assert expect == actual
assert mock_cursor.execute.called
An example using cursor, execute, and fetchone.
import snowflake.connector
class AlongSamePolly:
def __init__(self, conn):
self.conn = conn
def row_count(self):
cur = self.conn.cursor()
query = cur.execute('select count(*) from schema.table;')
return query.fetchone()[0] # returns (12345,)
# I like to dependency inject the snowflake connection object in my classes.
# This lets me use Snowflake Python Connector's built in context manager to
# rollback any errors and automatically close connections. Then you don't have
# try/except/finally blocks everywhere in your code.
#
if __name__ == '__main__':
with snowflake.connector.connect(user='user', password='password') as con:
same = AlongSamePolly(con)
print(same.row_count())
# => 12345
In the unittests you mock out the expected method calls - cursor(), execute(),
fetchone() and define the return value to follow up the chain of defined mocks.
import unittest
from unittest import mock
from along_same_polly import AlongSamePolly
class TestAlongSamePolly(unittest.TestCase):
def test_row_count(self):
with mock.patch('snowflake.connector.connect') as mock_snowflake_conn:
mock_query = mock.Mock()
mock_query.fetchone.return_value = (123,)
mock_cur = mock.Mock()
mock_cur.execute.return_value = mock_query
mock_snowflake_conn.cursor.return_value = mock_cur
same = AlongSamePolly(mock_snowflake_conn)
self.assertEqual(same.row_count(), 123)
if __name__ == '__main__':
unittest.main()
The following Solution Worked for me.
def test_connect(env_var_setup, monkeypatch):
monkeypatch.setattr(snowflake.connector.connection.SnowflakeConnection,
"connect", mocked_sf_connect
)
# calling snowflake connector method
file_job_map(env_var_setup).connect()
#mocked connection
def mocked_sf_connect(self, **kwargs):
print("Connection Successfully Established")
return True

Returning document as they are added with Scala reactivemongo

Reading http://reactivemongo.org/releases/0.11/documentation/tutorial/consume-streams.html have this code
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import play.api.libs.iteratee._
import reactivemongo.bson.BSONDocument
import reactivemongo.api.collections.bson.BSONCollection
def processPerson1(collection: BSONCollection, query: BSONDocument): Future[Unit] = {
val enumeratorOfPeople: Enumerator[BSONDocument] =
collection.find(query).cursor[BSONDocument].enumerate()
val processDocuments: Iteratee[BSONDocument, Unit] =
Iteratee.foreach { person =>
val lastName = person.getAs[String]("lastName")
val prettyBson = BSONDocument.pretty(person)
println(s"found $lastName: $prettyBson")
}
enumeratorOfPeople.run(processDocuments)
}
Run is defined as : 'drives the iteratee to consume the enumerator's input, adding an Input.EOF at the end of the input. Returns either a result or an exception.' from https://www.playframework.com/documentation/2.5.1/api/scala/index.html#play.api.libs.iteratee.Enumerator Does this mean that if a new document is added to the database the 'processPerson1' will need to be invoked again so that this line enumeratorOfPeople.run(processDocuments) can run in order for it be returned.
I just want to return the documents as their being added to DB without re-invoking same code. Possible 'not very good' solution is to to just wrap enumeratorOfPeople.run(processDocuments) in a scheduled thread but the problem of receiving all documents remains, I just want to return documents that have not yet been returned
I created a capped mongo collection using query:
db.createCollection("cappedCollection", { "capped": "true",
"autoIndexId ": "true", "size": 4096, "max": 10} )
Then using options when querying with find :
.options(QueryOpts().tailable.awaitData) - more detail: Play + ReactiveMongo: capped collection and tailable cursor
When a new document is added to collection the run method will return the latest added document.

Tokenization by Stanford parser is slow?

Quession Summary: tokenization by stanford parser is slow on my local machine, but unreasonably much much faster on spark. Why?
I'm using stanford coreNLP tool to tokenize sentences.
My script in Scala is like this:
import java.util.Properties
import scala.collection.JavaConversions._
import scala.collection.immutable.ListMap
import scala.io.Source
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val properties = new Properties()
val coreNLP = new StanfordCoreNLP(properties)
def tokenize(s: String) = {
properties.setProperty("annotators", "tokenize")
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.value.toString)
}
tokenize("Here is my sentence.")
One call of tokenize function takes roughly (at least) 0.1 sec.
This is very very slow because I have 3 million sentences.
(3M * 0.1sec = 300K sec = 5000H)
As an alternative approach, I have applied the tokenizer on Spark.
(with four worker machines.)
import java.util.List
import java.util.Properties
import scala.collection.JavaConversions._
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val file = sc.textFile("hdfs:///myfiles")
def tokenize(s: String) = {
val properties = new Properties()
properties.setProperty("annotators", "tokenize")
val coreNLP = new StanfordCoreNLP(properties)
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.toString)
}
def normalizeToken(t: String) = {
val ts = t.toLowerCase
val num = "[0-9]+[,0-9]*".r
ts match {
case num() => "NUMBER"
case _ => ts
}
}
val tokens = file.map(tokenize(_))
val tokenList = tokens.flatMap(_.map(normalizeToken))
val wordCount = tokenList.map((_,1)).reduceByKey(_ + _).sortBy(_._2, false)
wordCount.saveAsTextFile("wordcount")
This scripts finishes tokenization and word count of 3 million sentences just in 5 minites!
And results seems reasonable.
Why this is so first? Or, why the first scala script is so slow?
The problem with your first approach is that you set the annotators property after you initialize the StanfordCoreNLP object. Therefore CoreNLP is initialized with the list of default annotators which include the part-of-speech tagger and the parser which are orders of magnitude slower than the tokenizer.
To fix this, simply move the line
properties.setProperty("annotators", "tokenize")
before the line
val coreNLP = new StanfordCoreNLP(properties)
This should be even slightly faster than your second approach as you don't have to reinitialize CoreNLP for each sentence.