How can I load data from mongodb collection into pandas' DataFrame?

How can I load data from mongodb collection into pandas' DataFrame? - mongodb

I am new to pandas (well, to all things "programming"...), but have been encouraged to give it a try.
I have a mongodb database - "test" - with a collection called "tweets".
I access the database in ipython:
import sys
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.test
tweets = db.tweets
the document structure of documents in tweets is as follows:
entities': {u'hashtags': [],
u'symbols': [],
u'urls': [],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'filter_level': u'medium',
u'geo': {u'coordinates': [placeholder coordinate, -placeholder coordinate], u'type': u'Point'},
u'id': 349223842700472320L,
u'id_str': u'349223842700472320',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'place': {u'attributes': {},
u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate]]],
u'type': u'Polygon'},
u'country': u'placeholder country',
u'country_code': u'example',
u'full_name': u'name, xx',
u'id': u'user id',
u'name': u'name',
u'place_type': u'city',
u'url': u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
u'retweet_count': 0,
u'retweeted': False,
u'source': u'Twitter for iPhone',
u'text': u'example text',
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Sat Jan 22 13:42:59 +0000 2011',
u'default_profile': False,
u'default_profile_image': False,
u'description': u'example description',
u'favourites_count': 100,
u'follow_request_sent': None,
u'followers_count': 100,
u'following': None,
u'friends_count': 100,
u'geo_enabled': True,
u'id': placeholder_id,
u'id_str': u'placeholder_id',
u'is_translator': False,
u'lang': u'en',
u'listed_count': 0,
u'location': u'example place',
u'name': u'example name',
u'notifications': None,
u'profile_background_color': u'000000',
u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_tile': False,
u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
u'profile_image_url': u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_image_url_https': u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_link_color': u'000000',
u'profile_sidebar_border_color': u'FFFFFF',
u'profile_sidebar_fill_color': u'000000',
u'profile_text_color': u'000000',
u'profile_use_background_image': False,
u'protected': False,
u'screen_name': placeholder screen_name',
u'statuses_count': xxxx,
u'time_zone': u'placeholder time_zone',
u'url': None,
u'utc_offset': -21600,
u'verified': False}}
Now, as far as I understand, pandas' main data structure - a spreadsheet-like table - is called DataFrame. How can I load the data from my "tweets" collection into pandas' DataFrame? And how can I query for a subdocument within the database?

Comprehend the cursor you got from the MongoDB before passing it to DataFrame
import pandas as pd
df = pd.DataFrame(list(tweets.find()))

If you have data in MongoDb like this:
[
{
"name": "Adam",
"age": 27,
"address":{
"number": 4,
"street": "Main Road",
"city": "Oxford"
}
},
{
"name": "Steve",
"age": 32,
"address":{
"number": 78,
"street": "High Street",
"city": "Cambridge"
}
}
]
You can put the data straight into a dataframe like this:
from pandas import DataFrame
df = DataFrame(list(db.collection_name.find({}))
And you will get this output:
df.head()
| | name | age | address |
|----|---------|------|-----------------------------------------------------------|
| 1 | "Steve" | 27 | {"number": 4, "street": "Main Road", "city": "Oxford"} |
| 2 | "Adam" | 32 | {"number": 78, "street": "High St", "city": "Cambridge"} |
However the subdocuments will just appear as JSON inside the subdocument cell. If you want to flatten objects so that subdocument properties are shown as individual cells you can use json_normalize without any parameters.
from pandas.io.json import json_normalize
datapoints = list(db.collection_name.find({})
df = json_normalize(datapoints)
df.head()
This will give the dataframe in this format:
| | name | age | address.number | address.street | address.city |
|----|--------|------|----------------|----------------|--------------|
| 1 | Thomas | 27 | 4 | "Main Road" | "Oxford" |
| 2 | Mary | 32 | 78 | "High St" | "Cambridge" |

You can load your MongoDB data to pandas DataFame using this code. It works for me.
import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))

Use:
df=pd.DataFrame.from_dict(collection)

This is the simplest technique to achieve your aim.
import pymongo
import pandas as pd
from pymongo import Connection
conn = Connection()
db = conn.your_database_name
input_data = db.your_collection_name
pandas_data_frame = pd.DataFrame(list(input_data.find()))
print(pandas_data_frame)

Related

Save DF with JSON string as JSON without escape characters with Apache Spark

I have a dataframe which contains some column and json string:
val df = Seq (
(0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""),
(1, """{"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }""")
).toDF("id", "json")
Which I want to save as json - without a nested json string in it but a 'raw' one instead.
When I
df.write.json("path")
It saves my json column as string:
{"id":0,"json":"{\"device_id\": 0, \"device_type\": \"sensor-ipad\", \"ip\": \"68.161.225.1\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 25, \"signal\": 23, \"battery_level\": 8, \"c02_level\": 917, \"timestamp\" :1475600496 }"}
And what I need is:
{"id": 0,"json": {"device_id": 0,"device_type": "sensor-ipad","ip": "68.161.225.1","cca3": "USA","cn": "United States","temp": 25,"signal": 23,"battery_level": 8,"c02_level": 917,"timestamp": 1475600496}}
How can I achieve it? Please not that the structure of json could be different for each row, it can contain additional fields.

You can use from_json function to get the json string data as a new column
// get schema of the json data
// You can also define your own schema
import org.apache.spark.sql.functions._
val json_schema = spark.read.json(df.select("json").as[String]).schema
val resultDf = df.withColumn("json", from_json($"json", json_schema))
Output:
{"id":0,"json":{"battery_level":8,"c02_level":917,"cca3":"USA","cn":"United States","device_id":0,"device_type":"sensor-ipad","ip":"68.161.225.1","signal":23,"temp":25,"timestamp":1475600496}}
{"id":1,"json":{"battery_level":6,"c02_level":1413,"cca3":"NOR","cn":"Norway","device_id":1,"device_type":"sensor-igauge","ip":"213.161.254.1","signal":18,"temp":30,"timestamp":1475600498}}

Problem when querying Raw Data with STH-Comet - Returns empty

I have Orion, Cygnus and STH-Comet(installed and configured in formal mode). Each component is in a container docker. I implemented the infrastructure with docker-compose.yml.
The Cygnus container is configured as follows:
image: fiware/cygnus-ngsi:latest
hostname: cygnus
container_name: cygnus
volumes:
- /home/ubuntu/cygnus/multisink_agent.conf:/opt/fiware-cygnus/docker/cygnus-ngsi/multisink_agent.conf
depends_on:
- mongo
networks:
- default
expose:
- "5050"
- "5080"
ports:
- "5050:5050"
- "5080:5080"
environment:
- CYGNUS_SERVICE_PORT=5050
- CYGNUS_MONITORING_TYPE=http
- CYGNUS_AGENT_NAME=cygnus-ngsi
- CYGNUS_MONGO_SERVICE_PORT=5050
- CYGNUS_MONGO_HOSTS=mongo:27017
- CYGNUS_MONGO_USER=
- CYGNUS_MONGO_PASS=
- CYGNUS_MONGO_ENABLE_ENCODING=false
- CYGNUS_MONGO_ENABLE_GROUPING=false
- CYGNUS_MONGO_ENABLE_NAME_MAPPINGS=false
- CYGNUS_MONGO_DATA_MODEL=dm-by-entity
- CYGNUS_MONGO_ATTR_PERSISTENCE=column
- CYGNUS_MONGO_DB_PREFIX=sth_
- CYGNUS_MONGO_COLLECTION_PREFIX=sth_
- CYGNUS_MONGO_ENABLE_LOWERCASE=false
- CYGNUS_MONGO_BATCH_TIMEOUT=30
- CYGNUS_MONGO_BATCH_TTL=10
- CYGNUS_MONGO_DATA_EXPIRATION=0
- CYGNUS_MONGO_COLLECTIONS_SIZE=0
- CYGNUS_MONGO_MAX_DOCUMENTS=0
- CYGNUS_MONGO_BATCH_SIZE=1
- CYGNUS_LOG_LEVEL=DEBUG
- CYGNUS_SKIP_CONF_GENERATION=false
- CYGNUS_STH_ENABLE_ENCODING=false
- CYGNUS_STH_ENABLE_GROUPING=false
- CYGNUS_STH_ENABLE_NAME_MAPPINGS=false
- CYGNUS_STH_DB_PREFIX=sth_
- CYGNUS_STH_COLLECTION_PREFIX=sth_
- CYGNUS_STH_DATA_MODEL=dm-by-entity
- CYGNUS_STH_ENABLE_LOWERCASE=false
- CYGNUS_STH_BATCH_TIMEOUT=30
- CYGNUS_STH_BATCH_TTL=10
- CYGNUS_STH_DATA_EXPIRATION=0
- CYGNUS_STH_BATCH_SIZE=1
Obs: In the multisink_agent.conf file I changed the service and the servicepath:
cygnus-ngsi.sources.http-source-mongo.handler.default_service = tese
cygnus-ngsi.sources.http-source-mongo.handler.default_service_path = /iot
And the STH-Comet container looks like this:
image: fiware/sth-comet:latest
hostname: sth
container_name: sth
depends_on:
- cygnus
- mongo
networks:
- default
expose:
- "8666"
ports:
- "8666:8666"
environment:
- STH_HOST=0.0.0.0
- STH_PORT=8666
- DB_URI=mongo:27017
- DB_USERNAME=
- DB_PASSWORD=
- LOGOPS_LEVEL=DEBUG
In the STH-Comet config.js file I enabled CORS and I changed the defaultService and the defaultServicePath. The file looks like this:
var config = {};
// STH server configuration
//--------------------------
config.server = {
host: 'localhost',
port: '8666',
// Default value: "testservice".
defaultService: 'tese',
// Default value: "/testservicepath".
defaultServicePath: '/iot',
filterOutEmpty: 'true',
aggregationBy: ['day', 'hour', 'minute'],
temporalDir: 'temp',
maxPageSize: '100'
};
// Cors Configuration
config.cors = {
// The enabled is use to set CORS policy
enabled: 'true',
options: {
origin: ['*'],
headers: [
'Access-Control-Allow-Origin',
'Access-Control-Allow-Headers',
'Access-Control-Request-Headers',
'Origin, Referer, User-Agent'
],
additionalHeaders: ['fiware-servicepath', 'fiware-service'],
credentials: 'true'
}
};
// Database configuration
//------------------------
config.database = {
dataModel: 'collection-per-entity',
user: '',
password: '',
authSource: '',
URI: 'localhost:27017',
replicaSet: '',
prefix: 'sth_',
collectionPrefix: 'sth_',
poolSize: '5',
writeConcern: '1',
shouldStore: 'both',
truncation: {
expireAfterSeconds: '0',
size: '0',
max: '0'
},
ignoreBlankSpaces: 'true',
nameMapping: {
enabled: 'false',
configFile: './name-mapping.json'
},
nameEncoding: 'false'
};
// Logging configuration
//------------------------
config.logging = {
level: 'debug',
format: 'pipe',
proofOfLifeInterval: '60',
processedRequestLogStatisticsInterval: '60'
};
module.exports = config;
I use Cygnus to persist historical data. STH-Comet is used only to query raw and aggregated data.
Cygnus' signature on Orion did this:
"description": "A subscription All Entities",
"subject": {
"entities": [
{
"idPattern": ".*"
}
],
"condition": {
"attrs": []
}
},
"notification": {
"http": {
"url": "http://cygnus:5050/notify"
},
"attrs": [],
"attrsFormat":"legacy"
},
"expires": "2040-01-01T14:00:00.00Z",
"throttling": 5
}
The headers used for fiware-service and fiware-servicepath are:
Fiware-service: tese
Fiware-servicepath: /iot
The entities data are stored in orion-tese. I have the collection: entities
{
"_id" : {
"id" : "Tank1",
"type" : "Tank",
"servicePath" : "/iot"
},
"attrNames" : [
"temperature"
],
"attrs" : {
"temperature" : {
"value" : 0.333,
"type" : "Float",
"mdNames" : [ ],
"creDate" : 1594334464,
"modDate" : 1594337770
}
},
"creDate" : 1594334464,
"modDate" : 1594337771,
"lastCorrelator" : "f86d0d74-c23c-11ea-9c82-0242ac1c0005"
}
The raw and aggregated data are stored in sth_tese.
I have the collections:
sth_/iot_Tank1_Tank.aggr
and
sth_/iot_Tank1_Tank
The sth_/iot_Tank1_Tank raw data is in mongoDB:
{
"_id" : ObjectId("5f079d0369591c06b0fc981a"),
"temperature" : 279,
"recvTime" : ISODate("2020-07-09T22:41:05.670Z")
}
{
"_id" : ObjectId("5f07a9eb69591c06b0fc981b"),
"temperature" : 0.333,
"recvTime" : ISODate("2020-07-09T23:36:11.160Z")
}
When I run: http://localhost:8666/STH/v1/contextEntities/type/Tank/id/Tank1/attributes/temperature?aggrMethod=sum&aggrPeriod=minute
or
http://localhost:8666/STH/v2/entities/Tank1/attrs/temperature?type=Tank&aggrMethod=sum&aggrPeriod=minute
I have the result: "sum": 279 and "sum": 0.333. I can recover ALL the aggregated data, max, min, sum, sum2.
The difficulty is with the STH-Comet when I try to retrieve the raw data, the return code is 200 and the value returns empty.
I've tried with APIs v1 and v2, to no avail.
request with v2:
http://sth:8666/STH/v2/entities/Tank1/attrs/temperature?type=Tank&lastN=10
Return
{
"type": "StructuredValue",
"value": []
}
request with v1:
http://sth:8666/STH/v1/contextEntities/type/Tank/id/Tank1/attributes/temperature?lastN=10
Return
{
"contextResponses": [{
"contextElement": {
"attributes": [{
"name": "temperature",
"values": []
}],
"id": "Tank1",
"isPattern": false,
"type": "Tank"
},
"statusCode": {
"code": "200",
"reasonPhrase": "OK"
}
}]
}
The STH-Comet log shows that it is online and connects to the correct database:
time=2020-07-09T22:39:06.698Z | lvl=INFO | corr=n/a | trans=n/a | op=OPER_STH_DB_CONN_OPEN | from=n/a | srv=n/a | subsrv=n/a | comp=STH | msg=Establishing connection to the database at mongodb://#mongo:27017/sth_tese
time=2020-07-09T22:39:06.879Z | lvl=INFO | corr=n/a | trans=n/a | op=OPER_STH_DB_CONN_OPEN | from=n/a | srv=n/a | subsrv=n/a | comp=STH | msg=Connection successfully established to the database at mongodb://#mongo:27017/sth_tese
time=2020-07-09T22:39:07.218Z | lvl=INFO | corr=n/a | trans=n/a | op=OPER_STH_SERVER_START | from=n/a | srv=n/a | subsrv=n/a | comp=STH | msg=Server started at http://0.0.0.0:8666
The STH-Comet log with the api v2 request:
time=2020-07-09T23:46:47.400Z | lvl=DEBUG | corr=998811d9-fac2-4701-b37c-bb9ae1b45b81 | trans=998811d9-fac2-4701-b37c-bb9ae1b45b81 | op=OPER_STH_GET | from=n/a | srv=tese | subsrv=/iot | comp=STH | msg=GET /STH/v2/entities/Tank1/attrs/temperature?type=Tank&lastN=10
time=2020-07-09T23:46:47.404Z | lvl=DEBUG | corr=998811d9-fac2-4701-b37c-bb9ae1b45b81 | trans=998811d9-fac2-4701-b37c-bb9ae1b45b81 | op=OPER_STH_GET | from=n/a | srv=tese | subsrv=/iot | comp=STH | msg=Getting access to the raw data collection for retrieval...
time=2020-07-09T23:46:47.408Z | lvl=DEBUG | corr=998811d9-fac2-4701-b37c-bb9ae1b45b81 | trans=998811d9-fac2-4701-b37c-bb9ae1b45b81 | op=OPER_STH_GET | from=n/a | srv=tese | subsrv=/iot | comp=STH | msg=The raw data collection for retrieval exists
time=2020-07-09T23:46:47.412Z | lvl=DEBUG | corr=998811d9-fac2-4701-b37c-bb9ae1b45b81 | trans=998811d9-fac2-4701-b37c-bb9ae1b45b81 | op=OPER_STH_GET | from=n/a | srv=tese | subsrv=/iot | comp=STH | msg=No raw data available for the request: /STH/v2/entities/Tank1/attrs/temperature?type=Tank&lastN=10
time=2020-07-09T23:46:47.412Z | lvl=DEBUG | corr=998811d9-fac2-4701-b37c-bb9ae1b45b81 | trans=998811d9-fac2-4701-b37c-bb9ae1b45b81 | op=OPER_STH_GET | from=n/a | srv=tese | subsrv=/iot | comp=STH | msg=Responding with no points
According to the log, it establishes the connection to recover the raw data: msg=Getting access to the raw data collection for retrieval.... Confirms that the raw data exists: msg=The raw data collection for retrieval exists. But, it cannot recover this data and generates the message that the raw data is not available and does not return any points:msg=No raw data available for the request and msg=Responding with no points.
I already read the configuration part in the documentation. I've reinstalled everything, several times. I combed all settings and I can't find anything to justify this problem.
What could it be?
Could someone with expertise in STH-Comet give any guidance?
Thanks!

Sometimes the way in which STH tries to recover information doesn't match to the way in wich Cygnus store it. However, that doesn't to be the case here. The datamodel used by STH is configured with config.database.dataModel and it seems to be correct: collection-per-entity (as you have collections like sth_/iot_Tank1_Tank, which correspondds to a single entity, i.e. the one with id Tank1 and type Tank).
Assuming that the setting in config.js is not being overridden by DATA_MODEL env var (although it would be wise to check that, looking to the env vars actuallly inyected to the docker container running STH, I guess that with docker inspect) the only way I think we can continue debugging is to inspect which actual query does STH on MongoDB to end in No raw data available for the request.
MongoDB has a profiler that allows to record every query done in the DB. Thus the procedure would be as follows:
Avoid (or minimize) any other usage of MongoDB instance, to avoid "noise" in the information recorded by the profiler
Start the profiler in "all queries" mode (i.e. profiling level 2)
Do the query at STH API
Stop the profiler
Check the queries recorded by the profiler as a consequence of the request done in step 3
Explaining the usage of the MongoDB profiler is out of the scope of this answer, but the reference I provided above is a good starting point if you don't know it already.
Once you have information about the queries, please provide feedback as comments to this answers. Thanks!

plotting graph in sapui5 with pandas dataframe

As pandas supports dataframe to json conversion and the dataframe can be converted to a json data as below: (1) and 2) are just for references nothing to do with sapui5,
1) for eg:
import pandas as pd
df = pd.DataFrame([['madrid', 10], ['venice', 20],['milan',40],['las vegas',35]],columns=['city', 'temp'])
df.to_json(orient="records")
gives:
[{"city":"madrid","temp":10},{"city":"venice","temp":20},{"city":"milan","temp":40},{"city":"las vegas","temp":35}]
and
df.to_json(orient="split")
gives:
{"columns":["city","temp"],"index":[0,1,2,3],"data":[["madrid",10],["venice",20],["milan",40],["las vegas",35]]}
As we have json data , this data could be used as input to plot properties.
2)for the same json data I have created an API (running on localhost):
http://127.0.0.1:****/graph
API using in flask:(just for refernce)
from flask import Flask
import pandas as pd
app=Flask(__name__)
#app.route('/graph')
def plot():
df=pd.DataFrame([['madrid', 10], ['venice', 20], ['milan', 40], ['las vegas', 35]],
columns=['city', 'temp'])
jsondata=df.to_json(orient='records')
return jsondata
if __name__=='__main__':
app.run()
postman result:
[
{
"city": "madrid",
"temp": 10
},
{
"city": "venice",
"temp": 20
},
{
"city": "milan",
"temp": 40
},
{
"city": "las vegas",
"temp": 35
}
]
3)How can I make use of this sample api to fetch data and then plot a sample graph for {city vs temp} using sapui5 ??
looking for an example to do so, (or) any help on how to make use of api's in sapui5 ?

PostgresOperator in Airflow getting error while passing parameter

I have a dag which queries the postgress database, And I am using postgresOperator
however when passing the parameter I am getting the below Error.
psycopg2.ProgrammingError: column "132" does not exist
LINE 1: ...d,derived_tstamp FROM atomic.events WHERE event_name = "132"
snapshot of my dag below :
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=1),
}
dag = DAG("PostgresTest", default_args=default_args, schedule_interval='3,33 * * * *',template_searchpath = ['/root/airflow/sql/'])
dailyOperator = PostgresOperator(
task_id='Refresh_DailyScore',
postgres_conn_id='postgress_sophi',
params={"e_name":'"132"'},
sql='atomTest.sql',
dag=dag)
Snapshot of atomTest.sql
SELECT domain_userid,derived_tstamp FROM atomic.events WHERE event_name = {{ params.e_name }}
I am hitting my head the whole day to understand why airflow is considering 132 values as column.
Please suggest.

How can I query my collection of tweets for a given variable (tweets with hashtags) using pymongo?

I have a collection called "tweets" stored in my mongodb database called "test".
I connect to the db in the following manner:
import sys
import pymongo
import Connection from pymongo
connection = Connection()
db = connection.test
tweets = db.tweets
I get one document from my tweets as a list comprehension:
list(tweets.find())[0]
This shows me that the structure of the document is as follows:
{u'_id': ObjectId('...'),
u'contributors': None,
u'coordinates': {u'coordinates': [-placeholder coordinate, placeholder coordinate],
u'type': u'Point'},
u'created_at': u'Mon Jun 24 17:53:47 +0000 2013',
u'entities': {u'hashtags': [],
u'symbols': [],
u'urls': [],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'filter_level': u'medium',
u'geo': {u'coordinates': [40.81019674, -73.99020695], u'type': u'Point'},
u'id': 349223842700472320L,
u'id_str': u'349223842700472320',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'place': {u'attributes': {},
u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate]]],
u'type': u'Polygon'},
u'country': u'placeholder country',
u'country_code': u'example',
u'full_name': u'name, xx',
u'id': u'user id',
u'name': u'name',
u'place_type': u'city',
u'url': u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
u'retweet_count': 0,
u'retweeted': False,
u'source': u'Twitter for iPhone',
u'text': u'example text',
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Sat Jan 22 13:42:59 +0000 2011',
u'default_profile': False,
u'default_profile_image': False,
u'description': u'example description',
u'favourites_count': 100,
u'follow_request_sent': None,
u'followers_count': 100,
u'following': None,
u'friends_count': 100,
u'geo_enabled': True,
u'id': placeholder_id,
u'id_str': u'placeholder_id',
u'is_translator': False,
u'lang': u'en',
u'listed_count': 0,
u'location': u'example place',
u'name': u'example name',
u'notifications': None,
u'profile_background_color': u'000000',
u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_tile': False,
u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
u'profile_image_url': u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_image_url_https': u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_link_color': u'000000',
u'profile_sidebar_border_color': u'FFFFFF',
u'profile_sidebar_fill_color': u'000000',
u'profile_text_color': u'000000',
u'profile_use_background_image': False,
u'protected': False,
u'screen_name': placeholder screen_name',
u'statuses_count': xxxx,
u'time_zone': u'placeholder time_zone',
u'url': None,
u'utc_offset': -21600,
u'verified': False}}
I then query for all documents with hashtags in my collection:
list(tweets.find({'entities.hashtags.text': {"$ne":None}}))
So far so good. Now, here is my problem. I would like to sort the documents in my collection by screen_name. I try:
users = tweets.find({'entities.hashtags.text': {"$ne":None}}, {"user.screen_name":1})
for user in users:
print user.get["user.screen_name"]
but get the following error message:
TypeError Traceback (most recent call last)
/Users/home/<ipython-input-98-ea29cbbcfe27> in <module>()
1 for user in users:
----> 2 print user.get["screen_name"]
3
TypeError: 'builtin_function_or_method' object has no attribute '__getitem__'
Any idea what I'm doing wrong, here/any idea how I can fix my code?
Thanks!

You use brackets with the get method where you should use parentheses, so either access the key with user.get('screen_name') or user['screen_name'].

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I load data from mongodb collection into pandas' DataFrame? - mongodb

Comprehend the cursor you got from the MongoDB before passing it to DataFrame import pandas as pd df = pd.DataFrame(list(tweets.find()))

You can load your MongoDB data to pandas DataFame using this code. It works for me. import pymongo import pandas as pd from pymongo import Connection connection = Connection() db = connection.database_name input_data = db.collection_name data = pd.DataFrame(list(input_data.find()))

Use: df=pd.DataFrame.from_dict(collection)

This is the simplest technique to achieve your aim. import pymongo import pandas as pd from pymongo import Connection conn = Connection() db = conn.your_database_name input_data = db.your_collection_name pandas_data_frame = pd.DataFrame(list(input_data.find())) print(pandas_data_frame)

Related

Save DF with JSON string as JSON without escape characters with Apache Spark

Problem when querying Raw Data with STH-Comet - Returns empty

plotting graph in sapui5 with pandas dataframe

PostgresOperator in Airflow getting error while passing parameter

How can I query my collection of tweets for a given variable (tweets with hashtags) using pymongo?

Categories

Resources