How to store Theta Sketch (Yahoo) on SnappyData's table instead of write to file? Because I generate billions of sketches every day and need to keep many millions of sketches online for real-time queries. Can anyone help me? Thanks.
Can’t you store these in a column with blob data type ?
If writing out from a spark program and managing the sketches in a DF , I would think df.write.format(“column”).saveAsTable should work.
Else, serialize (sketch.compact().toByteArray()) and store in blob column using sql.
Related
I'm trying to write a dataframe to S3 from EMR-Spark and I'm seeing some really slow write times where the writing comes to dominate the total runtime (~80%) of the script. For what it's worth, I've tried both .csv and .parquet formats, it doesn't seem make a difference.
My data can be formatted in two ways, here's the preferred format:
ID : StringType | ArrayOfIDs : ArrayType
(The number of unique IDs in the first column numbers in the low millions. ArrayOfIDs contains GUID formatted strings, and can contain anywhere from ~100 - 100,000 elements)
Writing the first form to S3 is incredibly slow. For what it's worth, I've tried setting the mapreduce.fileoutputcommitter.algorithm.version to 2 as described here: https://issues.apache.org/jira/browse/SPARK-20107 to no real effect.
However my data can also be formatted as an adjacency list, like this:
ID1 : StringType | ID2 : StringType
This appears to be much faster for writing to S3, but I am at a loss for why. Here are my specific questions:
Ultimately I'm trying to get my data into an Aurora RDS Postgres cluster (I was told firmly by those before me that the Spark JDBC connector is too slow for the job, which is why I'm currently trying to dump the data in S3 before loading it into Postgres with a COPY command). I'm not married to using S3 as an intermediate store if there are better alternatives for getting these data frames into RDS Postgres.
I don't know why the first schema with the Array of Strings is so much slower on write. The total data written is actually far less than the second schema on account of eliminating ID duplication from the first column. Would also be nice to understand this behavior.
Well, I still don't know why writing arrays directly from Spark is so much slower than the adjacency list format. But best practice seems to dictate that I avoid writing to S3 directly from Spark.
Here's what i'm doing now:
Write the data to HDFS (anecdotally, the write speed of the adjacency list vs the array now falls in line with my expectations).
From HDFS, use EMR's s3-dist-cp utility to wholesale write the data to S3 (this also seems reasonably performant with array typed data).
Bring the data into Aurora Postgres with the aws_s3.table_import_from_s3 extension.
When referring to the spark ml/mllib docs, they all start from a svm stored example. This is really frustrating me since there doesn't seem to be a straightforward way to go from a standard RDD[Row] or Dataframe (taken from a "table" select) to this notation without first storing it.
This is just an inconvenience when dealing with 3 features or so, but when you scale that up to lots and lots of features, it's implying you will be doing a lot of typing and searching.
I ended up with something like this: (where "train" is a random split of a dataset w/ features stored in a table)
val trainLp = train.map(row => LabeledPoint(row.getInt(0).toDouble, Vectors.dense(row(8).asInstanceOf[Int].toDouble,row(9).asInstanceOf[Int].toDouble,row(10).asInstanceOf[Int].toDouble,row(11).asInstanceOf[Int].toDouble,row(12).asInstanceOf[Int].toDouble,row(13).asInstanceOf[Int].toDouble,row(14).asInstanceOf[Int].toDouble,row(15).asInstanceOf[Int].toDouble,row(18).asInstanceOf[Int].toDouble,row(21).asInstanceOf[Int].toDouble,row(27).asInstanceOf[Int].toDouble,row(28).asInstanceOf[Int].toDouble,row(29).asInstanceOf[Int].toDouble,row(30).asInstanceOf[Int].toDouble,row(31).asInstanceOf[Double],row(32).asInstanceOf[Double],row(33).asInstanceOf[Double],row(34).asInstanceOf[Double],row(35).asInstanceOf[Double],row(36).asInstanceOf[Double],row(37).asInstanceOf[Double],row(38).asInstanceOf[Double],row(39).asInstanceOf[Double],row(40).asInstanceOf[Double],row(41).asInstanceOf[Double],row(42).asInstanceOf[Double],row(43).asInstanceOf[Double])))
This is a nightmare to maintain, since these rows tend to change pretty often.
And here I'm only in the stage of getting labeled points, I'm not even at a svm stored version of this data.
What am I missing here that could potentially save me days of misery?
EDIT:
I got one step closer to the solution using something called a vectorassembler to build up my vector
Usually, CSV files are raw, unfiltered sources of information. Often they provide the original source of information.
In order to build a model, you usually want to go through a data cleansing, data preparation, data wrangling (and maybe more "data x" wording) phase before you build your model. This phase usually takes a big piece of the model building, and usually requires exploration of data. Typically, a process of transformation and feature selection (and creation) occurs between the original data and the data that builds the model.
If your CSV files don't need any of these preliminary phases - good for you!
You can always make configuration files that can keep track of certain columns or column indexes that build your model.
If your DataFrame comes from a "select", I guess what you can do to improve legibility and maintainability is to use column names instead of index numbers.
df.select($"my_col_1", $"my_col_2", .. )
and then operate through
row.getAs[String]("my_col_1")
This might sound like some stupid question.
I might write a MR code that can take input and output as HDFS locations and then I really don't need to worry about the parallel computing power of hadoop/MR. (Please correct me if I am wrong here).
However if my input is not an HDFS location say I am taking a MongoDB data as input - mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. I mean how can I be sure that the 1 GB or any sized big file is first being distributed on HDFS and then parallel computing is being done on it?
Is it that this direct URI will not distribute the data and I need to take the BSON file instead, load it up on HDFS and then give the HDFS path as Input to MR or the framework is smart enough to do this by itself?
I am sorry if the above question is too stupid or not making any sense at all. I am really new to big data but very much excited to dive into this domain.
Thanks.
You are describing DBInputFormat. This is an input format that reads the split from an external database. HDFS only gets involved in setting up the job, but not in actual input. There is also an DBOutputFormat. With an input like DBInputFormat the splits are logical, eg. key ranges.
Read Database Access with Apache Hadoop for a detailed explanation.
Sorry,I am not sure about MongoDb.
If you just wanted to know,how splitting is happening if we are using the data source is a table,then this is my answer when MapRed working with HBase.
we will use TableInputFormat to use an Hbase table in MapRed job.
From the http://hbase.apache.org/book.html#hbase.mapreduce.classpath
7.7. Map-Task Splitting
7.7.1. The Default HBase MapReduce Splitter
When TableInputFormat is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
7.7.2. Custom Splitters
For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.
This is a good question, not stupid.
1.
"mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. "
Under this situation, u needn't consider hdfs. U needn't do anything related with hdf. Just like write a multiple-thread application with each thread write data to mongodb.
In fact, hdfs is independent to map-reduce, and map-reduce is also independent to hdfs. So, u can use them separately or together as your wish.
2.
if u want to input/output db to map-reduce, u show consider DBInputFormat, but that's another question.
Now, hadoop DBInputFormat only support JDBC. I'm not sure whether some mongodb version of DBInputFormat. Maybe U can search it or implement it by yourself.
I am new to HBase. Here is my problem.
I have a very large HBase table. An example data in the table.
1003:15:Species1:MONTH:01 0.1,02 0.7,03 0.3,04 0.1,05 0.1,06 0,07 0,08 0,09 0.1,10 0.2,11 0.3,12 0.1:LATITUDE 26.664503840000002 29.145674380000003,LONGITUDE -96.27139215 -90.40762858
As you can see for each Species there is a month attribute (12 vectors), Lat & Long, etc. There are around 300 unique species and several 1000 observations for one particular species.
I have written a Mapreduce job which does K-means clustering on one particular species. The output of my MR is
C1:1003:15:Species1:MONTH:01 0.1,02 0.7,03 0.3,04 0.1,05 0.1,06 0,07 0,08 0,09 0.1,10 0.2,11 0.3,12 0.1:LATITUDE 26.664503840000002 29.145674380000003,LONGITUDE -96.27139215 -90.40762858
The C1 indicates which cluster it belongs to.
Now I want to visualize the output i.e plot all the Lat and Long for each cluster on a Map. I was thinking of using Mapbox.js and D3.js for my data visualization, since the Lat and Longs in the data are bounding boxes for a particular region.
If I write the o/p of my MR back to Hbase is it possible to retrive the data using javascript on the client side ?
I was thinking of either writing the data to MongoDB which I can query using JS or write a program to create a JSON from the Hbase table which I can visualize. Any suggestions ?
You can use HBAse REST API though security-wise it is probably safer to put your own service in the middle
you can also use node-hbase from https://github.com/alibaba/node-hbase-client to read the hbase data
you can also use hbase-rpc-client https://github.com/falsecz/hbase-rpc-client to read data from nodejs. This client supports hbase 0.96+
I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.