Extracting data from JSON files

Extracting data from JSON files - mongodb

I have three JSON files "Friends" and followers.
friends: contains the information about friends and their tweets
followers: contains information about followers and their tweets
tweets: contains all tweets
I want to extract the following info and store it in a MongoDB collection named "friends"
id_str,
name,
description,
favorites_count,
followers_count,
friends_count,
language,
location,
screen_name,
url,
utc_offset
the tricky part for me is the "Each user (friend or follower) must contain its tweets in a new field tweet"
any suggestions on how to achieve that?
Here what I am doing at the moment:
JsonSlurper slurper = new JsonSlurper()
def friends = slurper.parseText(new File('./friends.json').text)
def followers = slurper.parseText(new File('./followers.json').text)
def tweets = slurper.parseText(new File('./tweets.json').text)
friends.users.forEach{ fr ->
def frnds = mongo.friends << [
[
id_str: fr.id_str,
name: fr.name,
description: fr.description,
favorites_count: fr.favourite_count,
followers_count: fr.followers_count,
friends_count: fr.friends_count,
language: fr.language,
location: fr.location,
screen_name: fr.screen_name,
url: fr.url,
utc_offset: fr.utc_offset
]
]
}
Error: Exception in thread "main" groovy.lang.MissingPropertyException: No such property: friends for class

you can use mongoose population method to display/store object of your user.
for example
followers:[{
type:Schema.Types.ObjectId,
ref:'follower'
}]
you can use reference user id and store in array just id will going to store in followers array and you can populate all ids in to object, so try to use ref in mongoose model.
this might look little bit confusing but consider looking this at mongoose populate method
and also take a look at this video tutorial.
hope it helped!

Related

How to persist a document in json format using elasticsearch-dsl

I am trying to update an existing elasticsearch data pipeline and would like to use elasticsearch-dsl more fully. In the current process we create a document as a json object and then use requests to PUT the object to the relevant elasticsearch index.
I would now like to use the elasticsearch-dsl save method but am left struggling to understand how I might do that when my object or document is constructed as json.
Current Process:
//import_script.py
index = 'objects'
doc = {"title": "A title", "Description": "Description", "uniqueID": "1234"}
doc_id = doc["uniqueID"]
elastic_url = 'http://elastic:changeme#localhost:9200/' + index + '/_doc/ + doc_id
api = ObjectsHandler()
api.put(elastic_url, doc)
//objects_handler.py
class ObjectsHandler():
def put(self, url, object):
result = requests.put(url, json=object)
if result.status_code != requests.codes.ok:
print(result.text)
result.raise_for_status()
Rather than using this PUT method, I would like to tap into the Document.save functionality available in the DSL but I can't translate the examples in the api documentation for my use case.
I have amended my ObjectsHandler so that it can create the objects index:
//objects_handler.py
es = Elasticsearch([{'host': 'localhost', 'port': 9200}],
http_auth='elastic:changeme')
connections.create_connection(es)
class Object(Document):
physicalDescription = Text()
title = Text()
uniqueID = Text()
class Index:
name = 'objects'
using = es
class ObjectsHandler():
def init_mapping(self, index):
Object.init(using=es, index=index)
This successfully creates an index when I call api.init_mapping(index) from the importer script.
The documentation has this as an example for persisting the individual documents, where Article is the equivalent to my Object class:
# create and save and article
article = Article(meta={'id': 42}, title='Hello world!', tags=['test'])
article.body = ''' looong text '''
article.published_from = datetime.now()
article.save()
Is it possible for me to use this methodology but to persist my pre-constructed json object doc, rather than specifying individual attributes? I also need to be able to specify that the document id is the doc uniqueID.
I've extended my ObjectsHandler to include a save_doc method:
def save_doc(self, document, doc_id, index):
new_obj = Object(meta={'id': doc_id},
title="hello", uniqueID=doc_id,
physicalDescription="blah")
new_obj.save()
which does successfully save the object with uniqueID as id but I am unable to utilise the json object passed in to the method as document.

I've had some success at this by using elasticsearch.py bulk helpers rather than elasticsearch-dsl.
The following resources were super helpful:
Blog - Bulk insert from json objects
SO Answer, showing different ways to add keywords in a bulk action
Elastic documentation on bulk imports
In my question I was referring to a:
doc = {"title": "A title", "Description": "Description", "uniqueID": "1234"}
I actually have an array or list of 1 or more docs eg:
documents = [{"title": "A title", "Description": "Description", "uniqueID": "1234"}, {"title": "Another title", "Description": "Another description", "uniqueID": "1235"}]
I build up a body for the bulk import and append the id:
for document in documents:
bulk_body.append({'index': {'_id': document["uniqueID"]}})
bulk_body.append(document)
then run my new call to the helpers.bulk method:
api_handler.save_docs(bulk_body, 'objects')
with my objects_handler.py file looking like:
//objects_handler.py
from elasticsearch.helpers import bulk
es = Elasticsearch([{'host': 'localhost', 'port': 9200}],
http_auth='elastic:changeme')
connections.create_connection(es)
class Object(Document):
physicalDescription = Text()
title = Text()
uniqueID = Text()
class Index:
name = 'objects'
using = es
class ObjectsHandler():
def init_mapping(self, index):
Object.init(using=es, index=index)
def save_docs(self, docs, index):
print("Attempting to index the list of docs using helpers.bulk()")
resp = es.bulk(index='objects', body=docs)
print("helpers.bulk() RESPONSE:", resp)
print("helpers.bulk() RESPONSE:", json.dumps(resp, indent=4))
This works for single docs in a json format or multiple docs.

populate or aggregate 2 collection with sorting and pagination

I just use mongoose recently and a bit confused how to sort and paginate it.
let say I make some project like twitter and I had 3 schema. first is user second is post and third is post_detail. user schema contains data that user had, post is more like fb status or twitter tweet that we can reply it, post_detail is like the replies of the post
user
var userSchema = mongoose.Schema({
username: {
type: String
},
full_name: {
type: String
},
age: {
type: Number
}
});
post
var postDetailSchema = mongoose.Schema({
message: {
type: String
},
created_by: {
type: String
}
total_reply: {
type: Number
}
});
post_detail
var postDetailSchema = mongoose.Schema({
post_id: {
type: String
}
message: {
type: String
},
created_by: {
type: String
}
});
the relation is user._id = post.created_by, user._id = post_detail.created_by, post_detail.post_id = post._id
say user A make 1 post and 1000 other users comment on that posts, how can we sort the comment by the username of user? user can change the data(full_name, age in this case) so I cant put the data on the post_detail because the data can change dynamically or I just put it on the post_detail and if user change data I just change the post_detail too? but if I do that I need to change many rows because if the same users comment 100 posts then that data need to be changed too.
the problem is how to sort it, I think if I can sort it I can paginate it too. or in this case I should just use rdbms instead of nosql?
thanks anyway, really appreciate the help and guidance :))

Welcome to MongoDB.
If you want to do it in the way you describe, just don't go for Mongo.
You are designing the schema based on relations and not in documents.
Your design requires to do joins and this does not work well in mongo because there is not an easy/fast way of doing this.
First, I would not create a separate entity for the post details but embedded in the Post document the post details as a list.
Regarding your question:
or I just put it on the post_detail and if user change data I just
change the post_detail too?
Yes, that is what you should do. If you want to be able to sort the documents by the userName you should denormalize it and include in the post_details.
If I had to design the schema, it would be something like this:
{
"message": "blabl",
"authorId" : "userId12",
"total_reply" : 100,
"replies" : [
{
"message" : "okk",
"authorId" : "66234",
"authorName" : "Alberto Rodriguez"
},
{
"message" : "test",
"authorId" : "1231",
"authorName" : "Fina Lopez"
}
]
}
With this schema and using the aggregation framework, you can sort the comments by username.
If you don't like this approach, I rather would go for an RDBMS as you mentioned.

MongoDB: use of subdocuments

TLDR; Should you use subdocuments or relational Id?
This is my PostSchema:
const Post = new mongoose.Schema({
title: {
type: String,
required: true
},
body: {
type: String,
required: true
},
comments: [Comment.schema]
})
And this is my Comment Schema:
const Comment = new mongoose.Schema({
body: {
type: String,
required: true
}
})
In Postgres, I would have a post_id field in Comment, instead of having an array of comments inside Post. I am sure you can do the same in MongoDB but I don't know which one is more conventional. If people use subdocuments over references (and joining tables) in MongoDB, why is that? In other words, why should I ever use subdocuments? If it's advantageous, should I do the same in Postgres as well?

What I understood from your question, answering based on that.
If you will keep sub documents, you don't have to query two tables to know comments specific to one post.
Let's say we have following db structure for post:-
[{
_id:1,
title:'some title',
comments:[
{
...//some fields that belongs to comments
} ,
{
...//some fields that belongs to comments
} ,
...
]
},
{
_id:2,
title:'some title',
comments:[
{
...//some fields that belongs to comments
} ,
{
...//some fields that belongs to comments
} ,
...
]
}]
Now you can query based on _id of the post (1) and can get comments array that belongs to the specific post.
If you will just keep the comment's id inside post, you have to query both the tables, which I don't think is a good idea.
EDIT :-
If you are keeping post id inside comments record, then it will help you to track which comment is for which post i.e. if you want to query comments table based on post id and you need only fields from comments records.
What I think, use case will be which post contains what all comments. So keeping comment inside post will give you comments fields as well as fields from post record.
So it's totally depends on your requirement, how you will design your data structure.

Facebook: Getting name of the members

I am trying to name of the members of close friends list:
FB.api("/me/friendlists/close_friends?fields=members.fields(name)")
but it is not working.
FB.api("/me/friendlists/close_friends?fields=members")
gives me complete members object with following structure
Object {data: Array[1], paging: Object}
data: Array[1]
0: Object
id: "10150338266588525"
members: Object
data: Array[1]
0: Object
id: "812290716"
name: "My Friend"

To read a FriendList, issue an HTTP GET request to /FRIENDLIST_ID with
the read_friendlists permission.
FB.api("/FRIENDLIST_ID?fields=name,members.fields(name)")
You can also call like this, this returns name and id ( default), just ignore the userid, if you dont want.
FB.api("friendlists/close_friends?fields=members.fields(name)")
Try FB Graph API Explorer, Hope this helps.

Rest API get resource id by field

What is a correct rest way of getting a resource ID by a field, for example a name. Take a look at the following operations:
GET /users/mike-thomas
GET /users/rick-astley
I don't want to use these operations at my API end, instead I want to write an API operation that will get me the ID when submitting a field (name in the case of users) for example:
GET /users/id-by-field
Submitted data:
{
"fullName": "Mike Thomas"
}
Return data:
{
"data": {
"id": "123456789012345678901234"
}
}

What you want is known as an algorithmic URL where the parameters for the algorithm are passed as URL parameters:
GET /users?name="Mike Thomas"
Advantages are that you are using the "root" resource (users) and the search parameters are easily extended without having to change anything in the routing. For example:
GET /users?text="Mike"&year=1962&gender=M
where text would be searched for in more than just the name.
The resultant data would be a list of users and could return more than the identification of those users. Unless fullName uniquely identifies users, that is what you need to allow for anyway. And of course the list could contain a single user if the parameters uniquely identified that user.
{
users: [
{
id: "123456789012345678901234",
fullName: "Mike Thomas",
dateJoined: 19620228
}
, {
id: "234567890123456789012345"
fullName: "Rick Astley",
dateJoined: 19620227
}
]
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extracting data from JSON files - mongodb

Related

How to persist a document in json format using elasticsearch-dsl

populate or aggregate 2 collection with sorting and pagination

MongoDB: use of subdocuments

Facebook: Getting name of the members

Rest API get resource id by field

Categories

Resources