How Hadoop stores emails data - email

i'm naive to Big Data field. I started exploring about its tools like Hadoop and got clarity about this framework and Map/Reduce framework but still have a lot of questions:
Actually i want to analyse emails and do some Email categoriation so i can organise emails into different categories but was wondering how should i store those emails into HDFS.
Should i first of all convert my emails to text file (composed of spaced-separated columns: Date, Author, Subject, Content..) or to sequence file composed of binary key-value pairs and then store the file into HDFS ?
I'm not used to work with sequence files but i read many articles about how HDFS store unstructured data into those type of file.
can someone please enlighten me?
Thanks in advance.

Related

Extract Data in Sprinklr

I have list of 600 companies in an excel file. I want to extract data for each of them in .csv file.
I have started by searching for each company separately and downloading the excel and saving it. It's very time consuming.
Can someone help me with an easier method to solve this.

How best to store HTML alongside strings in Cloud Storage

I have a collection data of, and in each case there is chunk of HTML and a few strings, for example
html: <div>html...</div>, name string: html chunk 1, date string: 01-01-1999, location string: London, UK. I would like to store this information together as a single cloud storage object. Specifically, I am using Google Cloud Storage. There are two ways I can think of doing this. One is to store the strings as custom metadata, and the HTML as the actual file contents. The other is to store all the information as JSON file, with the HTML as a base64 encoded string.
I want to avoid a situation where after having stored a lot of data, I find there is some limitation to the approach I am using. What is the proper way to do this - is either of these approaches bad practice? Assuming there is no problem with either, I would go with the JSON approach because it is easier to pass around all the data together as a file.
There isn't a specific right way to do what you're talking about, there are potential pitfalls and performance criteria but they depend on what you're doing with the data and why. Do you ever need access to the metadata for queries? You won't be able to efficiently do that if you pack everything into one variable as a JSON object. What are you parsing the data with later? does it have built in support for JSON? Does it support something else? Is speed a consideration? Is cloud storage space a consideration? Does a user have the ability to input the html and could they potentially perform some sort of attack? How do you use the data when you retrieve it? How stable is the format of the data? You could use JSON, ProtocolBuffers, packed binary blobs in a length | value based format, base64 with a delimiter, zip files turned into binary blobs, do what suits your application and allows a clean structured design that you can test and maintain.

Document Repository REST application in java

I have a requirement to develop a Document Repository which will maintain all documents related to different listed Companies. Each document will be related to a Company. It has to be REST API. Documents can be in pdf, html, word or excel format. Along with storing documents, I need to store metadata as well like CompanyID, Doc format, timestamp, doc language etc.
As the number of document will grow in years to come, its important that the application is scalable.
Also need to translate non-English doc and store it translated English version in some parent-child relation which is easy for retrieval.
Any insights on the approach, libraries/jars to use and best practices and references are welcome.
The base 64 encoded content of the file could be included as the part of your payload along with file metadata.
Posting a File and Associated Data to a RESTful WebService preferably as JSON
Once the file reach to your end, you could save the file either local to your hard disk or save the same base64 encoded content as in your data store(user blob/clob).

Data mining on unstructured text

I am working right now at a academic project and I want to use data mining tehniques for a market segmentetion.
I want to store text information (which is supossed to be large amount of text), like tweets, news feed etc - so they are different source of data (they have different structure).
There are 2 questions:
What is the best way to get all this news articles, posts etc, so I can finally get enough text data to have the posibility to process it and to draw good conclusions from it? Or what other kind of unstructured data cand I use?
Where to store all the unstructured text, in order to access it later and apply all this text mining tehniques? What about MongoDB?
Thank you so much!
Take a look at the following:
Apache Lucene
Apache Solr
Elasticsearch

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)