convert unstructured data to structured data? [closed] - email

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
How can I convert unstructured data into structured data? For example email contacts, from an unstructured text, to a structured format.
Are there any algorithms to do this?

There's no generic algorithm to "take unstructured data and convert it to structured data", no. It's highly dependent on what the possible range of input is, and what the desired structure is, and what conversions need to be applied, etc.
The class of problem is called "parsing": you need to construct a parser for the specific inputs you expect, and use that parser to generate structure from what it discovers about the input you get.
Your programming language will likely have parsing libraries available to assist with constructing a specific parser.

Related

How can I store massive amounts of text in PostgreSQL? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to store massive amounts of data, specifically the amount of text equivalent to a book. How can I go about this? Is there a type of data storage that makes this process faster/easier (aka is fit) for this type of operation?
There are limits, but not that much. A single database can have (with default configurations) over a billion tables and each table can be 32TB in size.

When is it better to pull all data and filter, or pull the data filtered [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am working with Spark (pyspark) and MongoDB as a relational database.
We are running into some performance issues and the answers I found here were not directly related to Big Data.
We pull our entire mongoDB and then filter in Spark and when we apply some filters, some of the columns we don't filter are still present in the spark DataFrame(let me explain better this last case later).
My questions, besides a general understanding of the question's tittle:
Pull and filter, or filter and pull. If it's not a clear answer what are the parameters to start taking into account?
Let's say I have a Spark DataFrame with columns A,B,C and I filter only on C, it would be better (assuming I pulled everything) to drop then A and B?
Any links or readings regarding this are welcome.
1 - pull filtered data , it is more efficient to pull only the data you want. most database are optimize to do filter operation. the perfect case is when you can partition your data on your filtering columns (in your case columns C i guess)
2 - I am not sure but i think it's better to drop the colums you dont use, mainly to reduce the shuffle size if shuffle you have. and it also make your DataFrame more clear

Difference between ReduceByKey and CombineByKey in Spark [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is there any difference between ReduceByKey and CombineByKey when it comes to performance in Spark. Any help on this is appreciated.
Reduce by key internally calls combineBykey. Hence the basic way of task execution is same for both.
The choice of CombineByKey over reduceBykey is when the input Type and output Type is not expected to be the same. So combineByKey will have a extra overhead of converting one type to another .
If the type conversion is omitted there is no difference at all .
Please follow the following links
http://bytepadding.com/big-data/spark/reducebykey-vs-combinebykey
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey
http://bytepadding.com/big-data/spark/combine-by-key-to-find-max

Use cases for hstore vs json datatypes in postgresql [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
In Postgresql, the hstore and json datatypes seem to have very similar use cases. When would you choose to use one vs. the other? Initial thoughts:
You can nest with json; you can't with hstore
Functions for parsing json won't be available until 9.3
The json type is just a string. There are no built in functions to parse it. The only thing to be gained when using it is the validity checking.
Edit for those downvoting: This was written when 9.3 still didn't exist.It is correct for 9.2. Also the question was different. Check the edit history.

ADO.NET Performance : Which Approach will faster and resonable? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I want to select certain amount of data from one table. Based on those data, I want to check another two tables and insert into 2 tables.
So I want to iterate the resulted data. Which way is better(faster) and reasonable using DataReader or DataTable?
Thanks in advance
RedsDevils
You end up creating a reader to fill the table. The reverse isn't true, So I would stick with the dataReader.
-Josh