Producing ngram frequencies for a large dataset

Producing ngram frequencies for a large dataset - postgresql

I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes.
What's the best and most cost efficient way to do it in the next day or so?
My thoughts are:
PostgreSQL using regex to split sentences and words. I already have the WEX dump in PostgreSQL, and I already have regex to do the splitting (major accuracy isn't required here)
MapReduce with Hadoop
MapReduce with Amazon's Elastic MapReduce, which I know next to nothing about
My experience with Hadoop consists of calculating Pi on three EC2 instances very very inefficiently. I'm good with Java, and I understand the concept of Map + Reduce.
PostgreSQL I fear will take a long, long time, as it's not easily parallelisable.
Any other ways to do it? What's my best bet for getting it done in the next couple days?

Mapreduce will work just fine, and probably you could do most of the input-output shuffling by pig.
See
http://arxiv.org/abs/1207.4371
for some algorithms.
Of course, to make sure you get a running start, you don't actually need to be using mapreduce for this task; just split the input yourself, make the simplest fast program to calculate ngrams of a single input file and aggregate the ngram frequencies later.

Hadoop gives you two good things , which are main in my opinion: parralell task running (map only jobs) and distributed sort (shuffling between map and reduce
For the NGrams, it looks like you need both - parralel tasks (mappers) to emit ngrams and shuffling - to count number of each ngram.
So I think Hadoop here is ideal solution.

Related

How to split a large data frame and use the smaller parts to do multiple broadcast joins in Spark?

Let's say we have two very large data frames - A and B. Now, I understand if I use same hash partitioner for both RDDs and then do the join, the keys will be co-located and the join might be faster with reduced shuffling (the only shuffling that will happen will be when the partitioner changes on A and B).
I wanted to try something different though - I want to try broadcast join like so -> let's say the B is smaller than A so we pick B to broadcast but B is still a very big dataframe. So, what we want to do is to make multiple data frames out of B and then send each as broadcast to be joined on A.
Has anyone tried this?
To split one data frame into many I am only seeing randomSplit method but that doesn't look so great an option.
Any other better way to accomplish this task?
Thanks!

Has anyone tried this?
Yes, someone already tried that. In particular GoDataDriven. You can find details below:
presentation - https://databricks.com/session/working-skewed-data-iterative-broadcast
code - https://github.com/godatadriven/iterative-broadcast-join
They claim pretty good results for skewed data, however there are three problems you have to consider doing this yourself:
There is no split in Spark. You have to filter data multiple times or eagerly cache complete partitions (How do I split an RDD into two or more RDDs?) to imitate "splitting".
Huge advantage of broadcast is reduction in the amount of transferred data. If data is large, then amount of data to be transferred can actually significantly increase: (Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark)
Each "join" increases complexity of the execution plan and with long series of transformations things can get really slow on the driver side.
randomSplit method but that doesn't look so great an option.
It is actually not a bad one.
Any other better way to accomplish this task?
You may try to filter by partition id.

How to do parallel pipeline?

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?

I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.

indexing of large text files line by line for fast access

I have a very large text file around 43GB which I use to process them to generate another files in different forms. and i don't want to setup any databases or any indexing search engines
the data is in the .ttl format
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lb.dbpedia.org/resource/Mohandas_Karamchand_Gandhi> .
target is generating all combinations from all triples who share same subject:
for example for the subject Q1000 :
<http://nl.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://en.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
the problem:
the Dummy code to start with is iterating with complexity O(n^2) where n is the number of lines of the 45GB text file ,needless to say that it would take years to do so.
what i thought of to optimize :
loading a HashMap [String,IntArray] for indexing lines of appearance each key and using any library to access the file by line number for example:
Q1000 | 1,2,433
Q1001 | 2334,323,2124
drawbacks is that the index could be relatively large as well , considering that we will have another index for the access with specific line number , plus the overloaded i didnt try the performance of the
making a text file for each key like Q1000.txt for all triples contains subject Q1000 and iterating over them one by one and making combinations
drawbacks : this seems the fastest one and least memory consuming but certainly creating around 10 million files and accessing them will be a problem , is there and alternative for that ?
i'm using scala scripts for the task

Take the 43GB file in chunks that fit comfortably in memory and sort on the subject. Write the chunks separately.
Run a merge sort on the chunks (sorted by subject). It's really easy: you have as input iterators over two files, and you write out whichever input is less, then read from that one again (if there's any left).
Now you just need to make one pass through the sorted data to gather the groups of subjects.
Should take O(n) space and O(n log n) time, which for this sort of thing you should be able to afford.

A possible solution would be to use some existing map-reduce library. After all, your task is exactly what map-reduce is for. Even if you don't parallelize your computation on multiple machines, the main advantage is that it handles the management of splitting and merging for you.
There is an interesting library Apache Crunch with Scala API. I haven't used it myself, but it looks it could solve your problem well. Your lines would be split according to their subjects and then

Learning mapreduce, How is a SQL command translated in map reduce data flow. Word count example is not satisfying my understanding.

Online, I see so many examples of the canonical word count map reduce walk through. I understand mapper input of k,v => to reduce input of k,list(v). Some magic goes on by map reduce. i dont quite understand how to apply mapreduce to a more practical example. for example: let's say I have a file containing salaries of all employees in the US with some other details such as state and city etc... How would mapreduce work to provide an output report containing the following columns aggregated? State, city, avg(salaries)
In SQL I can get this with a query like this:
Select state, city, avg(salaries)
From employee_tbl
Group by state, city
How will map reduce work to give me the abover result set. I have used hive but I don't know how that SQL gets translated to map and reduce.

A simple way to covert your SQL query in a map-reduce job would be using HIVE over Hadooop.
But in case you dont want that,a simple rule of thumb that you can apply in most of the examples while emulating an SQL query to a map-reduce Job is-
Key-Out in Map function are the columns in your group by clause.
In your example let state-city be a key,which you will output in your Map function(use some seperator between them).
Value-out in your Map function is the column on which you want to run aggregation function.
In your example it would be individual salary(if there are more than 1 columns that you want to aggregate can separate thew by the same separator).
Key-in in Reduce will be the same as key-out of Map function
.
Value-out in reduce function will be the value after running aggregation function over value-out of all rows which have the same key
So in this case you will just sum up all the value-in(salary) and value-out will be the sum of salaries in a unique 'state-city' pair.

If you want to directly translate a SQL query to a set of Map/Reduce jobs, you should definitely take a look at YSmart. It is just a SQL to Map/Reduce built on top of Hadoop. Also some studies have shown it might be faster than Hive, although I can't back this claim as I haven't tested it myself.
As taken from their docs, YSmart provides:
High Performance: The MapReduce programs generated by YSmart are optimized. YSmart can automatically detect and utilize intra-query correlations when translating a query. This correlation-aware ability significantly reduces redundant computation, unnecessary disk IO operations and network overhead. See the Performance page to learn the performance benefits of YSmart.
High Extensibility: YSmart is easy to modify and extend. It is designed with the goal of extensibility. The major part of YSmart is implemented in Python which makes the codes much easier to understand. Due to its modularity and script nature, users can easily modify the current functionalities or add new functionalities to YSmart.
High Flexibility: YSmart can run in two different modes: translation-mode and execution-mode. In the translation-mode, YSmart only translates the query into Java codes while in the execution-mode YSmart will also compile and execute the generated codes. Because of this flexibility, users can easily read, modify and customize the generated codes.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!

Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.

It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.

If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily

You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse