I'm coming today for your help because i'm working on a project which requires the implementation of a REST API that query aprox 80million of indicators (documents) in a MongoDB Collection. Each indicator follow this "schema":
{
indicator_type: string,
indicator_name: string,
entityId: string,
date: date,
stringDate: string,
value: double,
}
Actually there is an API built in JAVA but it consumes a lot of CPU, memory and sometimes there are timeouts or the responses take a lot of time, that's the reason with need to remake it. So my questions are this ones:
How bad is saving the indicators like this and are there some patterns to save this kind of data?
Which programming language can you recommend to develop this kind of endpoints?
We think much of our problem is about the database, so we are thinking to move to Google BigQuery. Can BigQuery help to get fasters responses?
If BigQuery is not a great answer which other tools can you recommend to this use case?
What we are trying to achieve are responses like this one.
{
"totalConversions": {
"visitorPedestrianAverageConversion":5.847142857142858,
"ticketVisitorAverageConversion":0
},
"series":[
{
"data":[126,124,100,111,74,99,141],
"id":"indicator_type",
"type":"spline"
},
{
"data": [1925,2377,1873,1769,1067,2460,2139],
"id":"indicator_type",
"type":"spline"
},
{
"data":[0,0,0,0,0,0,0],
"id":"indicator_type",
"type":"spline"
},
{
"yAxis":1,
"data":[0,0,0,0,0,0,0],
"id":"indicator_type",
"type":"column"
},
{
"data":[0,0,0,0,0,0,0],
"id":"indicator_type",
"type":"spline"
},
{
"yAxis":2,
"data":[0,0,0,0,0,0,0],
"id":"indicator_type",
"type":"scatter"
},
{
"yAxis":2,
"data":[6.55,5.22,5.34,6.27,6.94,4.02,6.59],
"id":"indicator_type",
"type":"scatter"
},
{
"yAxis":2,
"data":[100,100,100,100,100,100,100],
"id":"indicator_type",
"type":"spline"
}
],
"categories":["Lun 02/03/2020","Mar 03/03/2020","Mié 04/03/2020","Jue 05/03/2020","Vie 06/03/2020","Sáb 07/03/2020","Dom 08/03/2020"]
}
Related
My aim is to reduce the json file size, which contains the base64 image sections of the documents by default.
I am using the Document AI - Contract Processor in US region, nodejs SDK.
It is my understanding that setting fieldMask attribute in batchProcessDocuments request filters out the properties that will be in the resulting json.
I want to keep only the entities property.
Here are my call parameters:
const documentai = require('#google-cloud/documentai').v1;
const client = new documentai.DocumentProcessorServiceClient(options);
let params = {
"name": "projects/XXX/locations/us/processors/3e85a4841d13ce5",
"region": "us",
"inputDocuments": {
"gcsDocuments": {
"documents": [{
"mimeType": "application/pdf",
"gcsUri": "gs://bubble-bucket-XXX/files/CymbalContract.pdf"
}]
}
},
"documentOutputConfig": {
"gcsOutputConfig": {
"gcsUri": "gs://bubble-bucket-XXXX/ocr/"
},
"fieldMask": {
"paths": [
"entities"
]
}
}
};
client.batchProcessDocuments(params, function(error, operation) {
if (error) {
return reject(error);
}
return resolve({
"operationName": operation.name
});
});
However, the resulting json is still containing the full set of data.
Am I missing something here?
The auto-generated documentation for the Node.JS Client Library is a little hard to follow, but it looks like the fieldMask should be a member of the gcsOutputConfig instead of the documentOutputConfig. (I'm surprised the API didn't throw an error)
https://cloud.google.com/nodejs/docs/reference/documentai/latest/documentai/protos.google.cloud.documentai.v1.documentoutputconfig.gcsoutputconfig
The REST Docs are a little more clear
https://cloud.google.com/document-ai/docs/reference/rest/v1/DocumentOutputConfig#gcsoutputconfig
Note: For a REST API call and for other client libraries, the fieldMask is structured as a string (e.g. text,entities,pages.pageNumber)
I haven't tried this with the Node Client libraries before, but I'd recommend trying this as well if moving the parameter doesn't work on its own.
https://cloud.google.com/document-ai/docs/send-request#async-processor
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
What's the "best" JSON structure when you need to "filter" through data in Firebase (in Swift)?
I'm having users sort their questions into:
Business
Entertainment
Other
Is it better to have a separate child for each question genre? If so, how do I get all of the data (when i want it), and then filter it only by "business" when I want to?
In NoSQL databases you usually end up modeling your data structure for the use-cases you want to allow in your app.
It's a bit of a learning path, so I'll explain it below in four steps:
Tree by category: Storing the data in a tree by its category, as you seem to be most interested in already.
Flat list of questions, and querying: Storing the data in a flat list, and then using queries to filter.
Flat list and indexes: Combining the above two approaches, to make the result more scalable.
Duplicating data: By duplicating data on top of that, you can reduce code complexity and improve performance further.
Tree by category
If you only want to get the questions by their category, you're best of simply storing each question under its category. In a simple model that'd look like this:
questionsByCategory: {
Business: {
question1: { ... },
question4: { ... }
},
Entertainment: {
question2: { ... },
question5: { ... }
},
Other: {
question3: { ... },
question6: { ... }
}
}
With the above structure, loading a list of question for a category is a simple, direct-access read for that category: firebase.database().ref("questionsByCategory").child("Business").once("value"....
But if you'd need a list of all questions, you'd need to read all categories, and denest the categories client-side. If you'd need a list of all question that is not a real problem, as you need to load them all anyway, but if you want to filter over some other condition than category, this may be wasteful.
Flat list of questions, and querying
An alternative is to create a flat list of all questions, and use queries to filter the data. In that case your JSON would look like this:
questions: {
question1: { category: "Business", difficulty: 1, ... },
question2: { category: "Entertainment", difficulty: 1, ... },
question3: { category: "Other", difficulty: 2, ... },
question4: { category: "Business", difficulty: 2, ... }
question5: { category: "Entertainment", difficulty: 3, ... }
question6: { category: "Other", difficulty: 1, ... }
}
Now, getting a list of all questions is easy, as you can just read them and loop over the results:
firebase.database().ref("questions").once("value").then(function(result) {
result.forEach(function(snapshot) {
console.log(snapshot.key+": "+snapshot.val().category);
})
})
If we want to get all questions for a specific category, we use a query instead of just the ref("questions"). So:
Get all Business questions:
firebase.database().ref("questions").orderByChild("category").equalTo("Business").once("value")...
Get all questions with difficult 3:
firebase.database().ref("questions").orderByChild("difficult").equalTo(3).once("value")...
This approach works quite well, unless you have huge numbers of questions.
Flat list and indexes
If you have millions of questions, Firebase database queries may not perform well enough anymore for you. In that case you may need to combine the two approaches above, using a flat list to store the question, and so-called (self-made) secondary indexes to perform the filtered lookups.
If you think you'll ever reach this number of questions, I'd consider using Cloud Firestore, as that does not have the inherent scalability limits that the Realtime Database has. In fact, Cloud Firestore has the unique guarantee that retrieving a certain amount of data takes a fixed amount of time, no matter how much data there is in the database/collection.
In this scenario, your JSON would look like:
questions: {
question1: { category: "Business", difficulty: 1, ... },
question2: { category: "Entertainment", difficulty: 1, ... },
question3: { category: "Other", difficulty: 2, ... },
question4: { category: "Business", difficulty: 2, ... }
question5: { category: "Entertainment", difficulty: 3, ... }
question6: { category: "Other", difficulty: 1, ... }
},
questionsByCategory: {
Business: {
question1: true,
question4: true
},
Entertainment: {
question2: true,
question5: true
},
Other: {
question3: true,
question6: true
}
},
questionsByDifficulty: {
"1": {
question1: true,
question2: true,
question6: true
},
"2": {
question3: true,
question4: true
},
"3": {
question3: true
}
}
You see that we have a single flat list of the questions, and then separate lists with the different properties we want to filter on, and the question IDs of the question for each value. Those secondary lists are also often called (secondary) indexes, since they really serve as indexes on your data.
To load the hard questions in the above, we take a two-step approach:
Load the questions IDs with a direct lookup.
Load each question by their ID.
In code:
firebase.database().ref("questionsByDifficulty/3").once("value").then(function(result) {
result.forEach(function(snapshot) {
firebase.database().ref("questions").child(snapshot.key).once("value").then(function(questionSnapshot) {
console.log(questionSnapshot.key+": "+questionSnapshot.val().category);
});
})
})
If you need to wait for all questions before logging (or otherwise processing) them, you'd use Promise.all:
firebase.database().ref("questionsByDifficulty/3").once("value").then(function(result) {
var promises = [];
result.forEach(function(snapshot) {
promises.push(firebase.database().ref("questions").child(snapshot.key).once("value"));
})
Promise.all(promises).then(function(questionSnapshots) {
questionSnapshots.forEach(function(questionSnapshot) {
console.log(questionSnapshot.key+": "+questionSnapshot.val().category);
})
})
})
Many developers assume that this approach is slow, since it needs a separate call for each question. But it's actually quite fast, since Firebase pipelines the requests over its existing connection. For more on this, see Speed up fetching posts for my social network app by using query instead of observing a single event repeatedly
Duplicating data
The code for the nested load/client-side join is a bit tricky to read at times. If you'd prefer only performing a single load, you could consider duplicating the data for each question into each secondary index too.
In this scenario, the secondary index would look like this:
questionsByCategory: {
Business: {
question1: { category: "Business", difficulty: 1, ... },
question4: { category: "Business", difficulty: 2, ... }
},
If you come from a background in relational data modeling, this may look quite unnatural, since we're now duplicating data between the main list and the secondary indexes.
To an experienced NoSQL data modeler however, this looks completely normal. We're trading off storing some extra data against the extra time/code it takes to load the data.
This trade-off is common in all areas of computer science, and in NoSQL data modeling you'll fairly often see folks choosing to sacrifice space (and thus store duplicate data) to get an easier and more scalable data model.
So I can't seem to get the results I want so I'm asking for your kind help on it. Imagine a MongoDB database built for multi-language content, something like this:
"info": {
"en": {
"greetings": {
"hello":"hello",
"goodbye":"goodbye"
},
"directions": {
"left": "left",
"right": "right"
}
},
"pt": {
"greetings": {
"hello":"olá",
"goodbye":"adeus"
},
"directions": {
"left": "esquerda",
"right": "direita"
}
}
}
Now if I want to query in order to get the directions object in both English and Portuguese but not the greetings, how should I do it?
If it helps, my purpose is to subscribe to exactly that content on a meteor app's template, so no need for all other objects within a given language object, just the one I need for any given template.
Thanks in advance!
how about you iterate?
db.collection.find().forEach((language){
some code here
})
Redux recommends using normalized app state tree but I am not sure if it's the best practice in this case. Assume the following case:
Each Circle has_many Posts.
Each Post has_many Comments.
In the database on the backend, each model looks like this:
Circle:
{
_id: '1'
title: 'BoyBand'
}
Post:
{
_id: '1',
circle_id: '1',
body: "Some Post"
}
Comment:
{
_id: '1',
post_id: '1',
body: "Some Comment"
}
In the app state (the final result of all reducers) on the frontend looks like this:
{
circles: {
byId: {
1: {
title: 'BoyBand'
}
},
allIds: [1]
},
posts: {
byId: {
1: {
circle_id: '1',
body: 'Some Post'
}
},
allIds: [1]
},
comments: {
byId: {
1: {
post_id: '1',
body: 'Some Comment'
},
allIds: [1]
}
}
Now, when I go to CircleView, I fetch Circle from the backend which returns all posts and comments associated with it.
export const fetchCircle = (title) => (dispatch, getState) => {
dispatch({
type: constants.REQUEST_CIRCLE,
data: { title: title }
})
request
.get(`${API_URL}/circles/${title}`)
.end((err, res) => {
if (err) {
return
}
// When you fetch circle from the API, the API returns:
// {
// circle: circleObj,
// posts: postsArr,
// comments: commentsArr
// }
// so it's easier for the reducers to consume the data
dispatch({
type: constants.RECEIVE_CIRCLE,
data: (normalize(res.body.circle, schema.circle))
})
dispatch({
type: 'RECEIVE_POSTS',
data: (normalize(res.body.posts, schema.arrayOfPosts))
})
dispatch({
type: 'RECEIVE_COMMENTS',
data: (normalize(res.body.comments, schema.arrayOfComments))
})
})
}
Up to this point, I think I did everything in a fairly standard way. However, when I wanted to render each Post component, I realized that populating the posts with their comments became inefficient (O(N^2)) compared to when I kept my state tree in the following format.
{
circles: {
byId: {
1: {
title: 'BoyBand'
}
},
allIds: [1]
},
posts: {
byId: {
1: {
circle_id: '1',
body: 'Some Post'
comments: [arrOfComments]
}
},
allIds: [1]
}
}
This goes against my understanding where in a redux state tree, it's better to keep everything normalized.
Q. Should I in fact keep things denormalized in a case like this? How do I determine what to do?
I'd go for: yes normalize it, but do it on the backend!
Why?
Deleting is easier
because otherwise, you'd have to track down the posts and comments every time you'd want to delete a circle, or post.
Working with the data is easier
because otherwise, you'd have to do the same mutations on your data over and over again just so that you can select the dataset which is related to a particular circle or post.
You don't have any many-to-many relationship
you don't have multiple posts which link to the same comment so it just makes sense to have the data normalized.
You shouldn't be limited by an API
If this is a third-party API then make your backend fetch the API and normalize the data there. You shouldn't be restricted by the API and I don't know what kind of data you access but you can definitively save a DNS lookup for the user and serve cached data if the API is unavailable. If you rely on the API to being up you introduce a single point of failure.
About your performance issues, they should be insignificant if you normalize on the backend and you should measure it and take the critical code for a code review.
In my opinion, the list of comments is specific for any post. User cannot post one comment into multiple posts. And there's nothing wrong that comments are tightly coupled with the post. It's easy to update/remove specific comment(both postId and commentId are present). Removing a post is trivial. Same with circle. It's insignificantly harder to remove all comments of a specific user. And I think that there are no strict rules, the RIGHT way, etc... more often it depends. KiSS ;)
While thinking how to organize comments on client side I was reading this article, it's about possible db structures for similar situation. https://docs.mongodb.com/ecosystem/use-cases/storing-comments/
As a follow up to my previous question about REST URIs for retrieving statistical information for a web forum Resource, I want to know if it is possible to use the internal anchors as filter hints. See example below:
a) Get all statistics:
GET /group/5t7yu8i9io0op/stat
{
group_id: "5t7yu8i9io0op",
top_ranking_users: {
[ { user: "george", posts: 789, rank: 1 },
{ user: "joel", posts: 560, rank: 2 } ...]
},
popular_topics: {
[ ... ]
},
new_topics: {
[ ... ]
}
}
b) GET only popular topics
GET /group/5t7yu8i9io0op/stat#popular_topics
{
group_id: "5t7yu8i9io0op",
popular_topics: {
[ ... ]
}
}
c) GET only top ranking users
GET /group/5t7yu8i9io0op/stat#top_ranking_users
{
group_id: "5t7yu8i9io0op",
top_ranking_users: {
[ { user: "george", posts: 789, rank: 1 },
{ user: "joel", posts: 560, rank: 2 } ...]
}
}
Or should I be using query parameters ?
Not sure what you are trying to do exactly, but make sure you understand that fragment identifiers are not seen by the server, they are chopped off by the client connector.
See: http://www.nordsc.com/blog/?p=17
I've never seen anchors being used that way - it's interesting. That being said, I'd suggest using query parameters for a couple of reasons:
They're standard - and consumers of your api will be comfortable with them. There's nothing more annoying that dealing with a quirky api.
Many frameworks will auto-parse the query parameters and set them in a dictionary on the request object (or whatever analogue exists in your framework / http server library).
I think it would make more sense to have:
/group/5t7yu8i9io0op/stat/top_users
/group/5t7yu8i9io0op/stat/popular_topics
/group/5t7yu8i9io0op/stat/new_topics
/group/5t7yu8i9io0op/stat/user/george
No you cannot do that because as Jan points out the server will never see that fragment identifier. Literally, that part of the url will not reach the server.