Forum like data structure: NoSQL appropriate? - mongodb

I'm trying to save data which has a "forum like" structure:
This is the simplified data model:
+---------------+
| Forum |
| |
| Name |
| Category |
| URL |
| |
+---------------+
|1
|n
+---------------+
| |
| Thread |
| |
| ID |
| Name |
| Author |
| Creation Date |
| URL |
| |
+---------------+
|1
|n
+---------------+
| |
| Post |
| |
| Creation Date |
| Links |
| Images |
| |
+---------------+
I have multiple forums/boards. They can have some threads. A thread can contain n posts (I'm just interested in the links, images and creation date a thread contains for data analysis purposes).
I'm looking for the right technology for saving and reading data in a structure like this.
While I was using SQL databases heavily in the past, I also had some NoSQL projects (primarily document based with MongoDB).
I'm sure MongoDB is excellent for STORING data in such a structure (Forum is a document, while the Threads are subdocuments. Posts are subdocuments in Threads). But what about reading them? I have the following use cases:
List all posts from a forum with a specific Category
Find a specific link in a Post in all datasets/documents
Which technology is best for those use cases?

Please find below my draft solution. I have considered MongoDB for the below design.
Post Collection:-
"image" should be stored separately in GridFS as MongoDB collection have a maximum size of 16MB. You can store the ObjectId of the image in the Post collection.
{
"_id" : ObjectId("57b6f7d78f19ac1e1fcec7b5"),
"createdate" : ISODate("2013-03-16T02:50:27.877Z"),
"links" : "google.com",
"image" : ObjectId("5143ddf3bcf1bf4ab37d9c6e"),
"thread" : [
{
"id" : ObjectId("5143ddf3bcf1bf4ab37d9c6e"),
"name" : "Sam",
"author" : "Sam",
"createdate" : ISODate("2013-03-16T02:50:27.877Z"),
"url" : "https://www.wikipedia.org/"
}
],
"forum" : [
{
"name" : "Andy",
"category" : "technology",
"url" : "https://www.infoq.com/"
}
]
}
In order to access the data by category, you can create an index on "forum.category" field.
db.post.createIndex( { "forum.category": 1 } )
In order to access the data by links, you can create an index on "links" field.
db.organizer.createIndex( { "links": 1 } )
Please note that the indexes are not mandatory. You can access/query the data without index as well. You can create indexes if you need better read performance.
I have seen applications using MongoDB for similar use case as yours. You can go ahead with MongoDB for the above mentioned use cases (or access patterns).

Related

Parse record (PCF) from Kafka using Kafka Kusto Sink

I've set-up my environment using docker based on this guide.
On kafka-console-producer I will send this line:
Hazriq|27|Undegrad|UNITEN
I want this data to be ingested to Kusto like this:
+--------+-----+----------------+------------+
| Name | Age | EducationLevel | University |
+--------+-----+----------------+------------+
| Hazriq | 27 | Undegrad | UNITEN |
+--------+-----+----------------+------------+
Can this be handled by Kusto using the mapping (which I'm still trying to understand) or this should be catered by Kafka?
Tried #daniel suggestion:
.create table ParsedTable (name: string, age: int, educationLevel: string, univ:string)
.create table ParsedTable ingestion csv mapping 'ParsedTableMapping' '[{ "Name" : "name", "Ordinal" : 0},{ "Name" : "age", "Ordinal" : 1 },{ "Name" : "educationLevel", "Ordinal" : 2},{ "Name" : "univ", "Ordinal" : 3}]'
kusto.tables.topics_mapping=[{'topic': 'kafkatopiclugiaparser','db': 'kusto-test', 'table': 'ParsedTable','format': 'psv', 'mapping':'ParsedTableMapping'}]
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
but getting this instead:
+----------------------------+-----+----------------+------+
| Name | Age | EducationLevel | Univ |
+----------------------------+-----+----------------+------+
| Hazriq|27|Undergrad|UNITEN | | | |
+----------------------------+-----+----------------+------+
Currently, the connector passes the data as it comes (no manipulation on it on the client side), and any parsing is left to Kusto.
As such, psv format is supported by kusto, and it should be possible by setting the format to psv and providing a mapping reference.
When adding the plugin as described, you should be able to set it up like:
kusto.tables.topics_mapping=[{'topic': 'testing1','db': 'testDB', 'table': 'KafkaTest','format': 'psv', 'mapping':'KafkaMapping'}]
The mapping can be defined in Kusto as described in the Kusto docs defined like so
ingestion of data as you've shown using the psv format is supported (see below) - it's probably just a matter of debugging why your client-side invocation of the underlying commands aren't yielding the expected result. if you could share the full flow and code, including parameters, it may be helpful.
.create table ParsedTable (name: string, age: int, educationLevel: string, univ:string)
.ingest inline into table ParsedTable with(format=psv) <| Hazriq|27|Undegrad|UNITEN
ParsedTable:
| name | age | educationLevel | univ |
|--------|-----|----------------|--------|
| Hazriq | 27 | Undegrad | UNITEN |

Thinking NoSql on reference data

I'm trying out NoSql and while exploring I can't get into my head on how to deal with reference data. (I'm used to traditional database, the tabular one) Say, I have a School entity which have Students and Requirements. Now, the Student can be enrolled to a School and may comply the Requirements later. So the School would look for a Student and check on which Requirements did he comply.
On traditional database, I would do something like.
+---------+ +---------------+ +--------------------+ +---------+
| School | | Requirement | | StudentRequirement | | Student |
+---------+ +---------------+ +--------------------+ +---------+
| Id (PK) | | Id (PK) | | Id (PK) | | Id (PK) |
| Name | | Name | | StudentId (FK) | | Name |
+---------+ | SchoolId (FK) | | RequirementId (FK) | +---------+
+---------------+ | HasComply |
+--------------------+
I would create 4 Entities, and the Requirement has a many-to-many relationship to a Student. So whether I edit or remove a Requirement I can just look at the intermediary table.
A flow something like:
// EnrollStudentToASchool
// AssignAllRequirementsToNewStudent
Then somewhere in my code, if a new requirement was created
// IfNewRequirement
// AddNewRequirementToStudent
Now, in NoSql and in my case I'm using mongodb, a doc type data store. I read somewhere that data should be inline. Something like:
{
Id: 1,
School: 'Top1 Foo School',
Requirements:
[
{ Id: 1, Name: 'Req1' },
{ Id: 2, Name: 'Req2' }
],
Students:
[
{
Id: 1,
Name: 'Student1',
Requirements:
[
{ Id: 1, Name: 'Req1', HasComply: false },
{ Id: 2, Name: 'Req2', HasComply: true },
]
}
]
},
{
Id: 2,
School: 'Top1 Bar School',
Requirements: [],
Students: []
}
The root of my document will be the School, same flow above:
// EnrollStudentToASchool
// AssignAllRequirementsToNewStudent
// IfNewRequirement
// AddNewRequirementToStudent
But in case of, say, the School decided to edit the name of the Requirement or remove a Requirement.
How it should be done? Should I loop all my Students and Edit/Remove the Requirements? Or maybe I'm doing it all wrong.
Please advise.
This a nice use case.
Your example brings up most of the relevant pros and cons about converting from sql to noSql.
First please see proposed collection design:
We have two collections: school and student why that? We need to think about bson document size limitation (16MB) and if we have a good school number of students could go over that size.
So why we duplicate data in every student record? If we want to have students details we don't need to go to school (no extra round trip).
We have array of requirements to fulfil in school (a kind of master), and then every student has its own array with result.
Adding / removing such data requires iteration via all students and school.
So in simply words - no join on daily display operations=> efficiency, but update generates a bit more load versus sql.
Any comments welcome!

Optimizing MongoDB indexing multiple field with multiple query

I am new to database indexing. My application has the following "find" and "update" queries, searched by single and multiple fields
reference | timestamp | phone | username | key | Address
update x | | | | |
findOne | x | x | | |
find/limit:16 | x | x | x | |
find/limit:11 | x | | | x | x
find/limit:1/sort:-1 | x | x | | x | x
find | x | | | |
1)update({"reference":"f0d3dba-278de4a-79a6cb-1284a5a85cde"}, ……….
2)findOne({"timestamp":"1466595571", "phone":"9112345678900"})
3)find({"timestamp":"1466595571", "phone":"9112345678900", "username":"a0001a"}).limit(16)
4)find({"timestamp":"1466595571", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).limit(11)
5)find({"timestamp":"1466595571", "phone":"9112345678900", "key":"443447644g5fff", "address":"abc road, mumbai, india"}).sort({"_id":-1}).limit(1)
6)find({"timestamp":"1466595571"})
I am creating index
db.coll.createIndex( { "reference": 1 } ) //for 1st, 6th query
db.coll.createIndex( { "timestamp": 1, "phone": 1, "username": 1 } ) //for 2nd, 3rd query
db.coll.createIndex( { "timestamp": 1, "key": 1, "address": 1, phone: 1 } ) //for 4th, 5th query
Is this the correct way?
Please help me
Thank you
I think what you have done looks fine. One way to check if your query is using an index, which index is being used, and whether the index is effective is to use the explain() function alongside your find().
For example:
db.coll.find({"timestamp":"1466595571"}).explain()
will return a json document which details what index (if any) was used. In addition to this you can specify that the explain return "executionStats"
eg.
db.coll.find({"timestamp":"1466595571"}).explain("executionStats")
This will tell you how many index keys were examined to find the result set as well as the execution time and other useful metrics.

Removing empty Columns in Pentaho Kettle before inserting on MongoDB

I am using Pentaho Kettle as a tool to process several CSV files before inserting them in MongoDB for the first time.
Since MongoDB is schemaless I don't seen the point in keeping the null column values of the CSV rows. I want to do receive something like this from the CSV
+------------+----------+---------+
| _id | VALUE_1 | VALUE_2 |
+------------+----------+---------+
| 1 | 1 | 1 |
| 2 | 2 | null |
| 3 | null | 2 |
+------------+----------+---------+
And insert it onto mongodb in a way that I get this in there:
{ "_id" : 1, "VALUE_1" : 1, "VALUE_2" : 1 }
{ "_id" : 2, "VALUE_1" : 2 }
{ "_id" : 3, "VALUE_2" : 2}
How would I do such a thing in Kettle? I just can't seem to find the right option there, there is a filter rows but it doesn't seem what I want.
I'm having the same problem. One work around I found from Matt Casters and Diethard Steiner is to un pivot the data and then remove Null Rows. Then you could pivot back and write up the JSON with a javascript step or JSON output perhaps. Similar to this:
http://diethardsteiner.blogspot.com/2010/11/pentaho-kettle-data-input-pivoted-data.html
This worked fine for small files, but I have large csv's with 30-100 columns and hundreds of thousands of rows, millions in some cases. So pivoting is very slow.. but maybe you can come up with another idea, I'd be glad to hear it! =)

How to design the schema when the embedded documents are too big

Given the data structure as follows, as you can see each record inside one file has the same values for ATT1 and ATT2.
// Store in fileD001.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D001 | 10102011 | x13 | x14 ... | x1200
D001 | 10102011 | x23 | x24 ... | x2200
...
D001 | 10102011 | xN3 | xN4 ... | xN200
// Store in fileD002.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D002 | 10112011 | x13 | x14 ... | x1200
D002 | 10112011 | x23 | x24 ... | x2200
...
D002 | 10112011 | xN3 | xN4 ... | xN200
// Store in fileD003.txt
ATT1 | ATT2 | ATT3 | ATT4 ... | ATT200
D003 | 10132011 | x13 | x14 ... | x1200
D003 | 10132011 | x23 | x24 ... | x2200
...
D003 | 10132011 | xN3 | xN4 ... | xN200
Method One: Assume I use the following structure to store the data.
doc = { “ATT1" : "D001",
"ATT2" : "10102011",
"ATT3" : "x13",
"ATT4" : "x14",
...
"ATT200" : "x1200"
}
Here is the problem, the data contains too much duplicated information and waste the space of DB. However, the benefit is that each record has its own _id.
Method One: Assume I use the following structure to store the data.
doc = { “ATT1" : "D001",
"ATT2" : "10102011",
"sub_doc" : { "ATT3" : "x13",
"ATT4" : "x14",
...
"ATT200" : "x1200"
}
}
Here is the problem, the data size N, which is around 1~5000, is too much and cannot be handled by MongoDB in one insertion operation. Of course, we can use $push update modifier to gradually append the data. However, each record has no _id any more in this way.
I don't mean each record has to have its own ID. I am just looking for a better design solution for the task like this.
Thank you
Option 1 is decent since it gives you the easiest data to work with. Maybe worry less about the space since it is cheap?
Option 2 is good to conserve space, though watch out that your document does not get too large. Maximum document size may limit you. Also, if you shard in the future this could limit you.
Option 3 is being a little relational about it. Have two collections. The first one is just a lookup for ATT1 and ATT2 pairs. The other collection is a reference to the other and the final atts.
parent = { att1: "val1", att2: "val2"}
child = {parent: parent.id, att3: "val3"...}