spark: convert dataframe to svm labeled point - scala

When referring to the spark ml/mllib docs, they all start from a svm stored example. This is really frustrating me since there doesn't seem to be a straightforward way to go from a standard RDD[Row] or Dataframe (taken from a "table" select) to this notation without first storing it.
This is just an inconvenience when dealing with 3 features or so, but when you scale that up to lots and lots of features, it's implying you will be doing a lot of typing and searching.
I ended up with something like this: (where "train" is a random split of a dataset w/ features stored in a table)
val trainLp = train.map(row => LabeledPoint(row.getInt(0).toDouble, Vectors.dense(row(8).asInstanceOf[Int].toDouble,row(9).asInstanceOf[Int].toDouble,row(10).asInstanceOf[Int].toDouble,row(11).asInstanceOf[Int].toDouble,row(12).asInstanceOf[Int].toDouble,row(13).asInstanceOf[Int].toDouble,row(14).asInstanceOf[Int].toDouble,row(15).asInstanceOf[Int].toDouble,row(18).asInstanceOf[Int].toDouble,row(21).asInstanceOf[Int].toDouble,row(27).asInstanceOf[Int].toDouble,row(28).asInstanceOf[Int].toDouble,row(29).asInstanceOf[Int].toDouble,row(30).asInstanceOf[Int].toDouble,row(31).asInstanceOf[Double],row(32).asInstanceOf[Double],row(33).asInstanceOf[Double],row(34).asInstanceOf[Double],row(35).asInstanceOf[Double],row(36).asInstanceOf[Double],row(37).asInstanceOf[Double],row(38).asInstanceOf[Double],row(39).asInstanceOf[Double],row(40).asInstanceOf[Double],row(41).asInstanceOf[Double],row(42).asInstanceOf[Double],row(43).asInstanceOf[Double])))
This is a nightmare to maintain, since these rows tend to change pretty often.
And here I'm only in the stage of getting labeled points, I'm not even at a svm stored version of this data.
What am I missing here that could potentially save me days of misery?
EDIT:
I got one step closer to the solution using something called a vectorassembler to build up my vector

Usually, CSV files are raw, unfiltered sources of information. Often they provide the original source of information.
In order to build a model, you usually want to go through a data cleansing, data preparation, data wrangling (and maybe more "data x" wording) phase before you build your model. This phase usually takes a big piece of the model building, and usually requires exploration of data. Typically, a process of transformation and feature selection (and creation) occurs between the original data and the data that builds the model.
If your CSV files don't need any of these preliminary phases - good for you!
You can always make configuration files that can keep track of certain columns or column indexes that build your model.
If your DataFrame comes from a "select", I guess what you can do to improve legibility and maintainability is to use column names instead of index numbers.
df.select($"my_col_1", $"my_col_2", .. )
and then operate through
row.getAs[String]("my_col_1")

Related

Databricks - Reduce delta version compute time

I've got a process which is really bogged down by the version computing for the target delta table.
Little bit of context - there are other things that run, all contributing uniform structured dataframes that I want to persist in a delta table. Ultimately these are all compiled into lots_of_dataframes to be logged.
for i in lots_of_dataframes:
i.write.insertInto("target_delta_table")
# ... take a while to compute version
I've got in to the documentation but couldn't find any setting to ignore the version compute. I did see vacuuming, but not sure that'll do the since there will still be a lot of activity in a small window of time.
I know that I can union all of the dataframes together and just do the insert once, but I'm wondering if there is a more Databricks-ian way to do it. Like a configuration to only maintain 1 version at a time and not worry about computing for a restore.
Most probably, but it's hard to say exactly without details, the problem arise from the following facts:
Spark is lazy - the actual data processing doesn't happen until you perform action, like writing data into a destination table. So if you have a lot of transformations, etc., they will happen when you're writing data.
You're writing data in the loop - you can potentially speedup it a bit by doing a union of all tables into a single dataframe, that will be written into one go:
import functools
unioned = functools.reduce(lambda x,y: x.union(y), lots_of_dataframes)
unioned.write.insertInto("target_delta_table")

What are some of the most efficient workflows for processing "big data" (250+ GB) from postgreSQL databases?

I am constructing a script that will be processing well-over 250+ GB of data from a single postgreSQL table. The table's shape is ~ 150 cols x 74M rows (150x74M). My goal is to somehow sift through all the data and make sure that each cell entry meets certain criteria that I will be tasked with defining. After the data has been processed I want to pipeline it into an AWS instance. Here are some scenarios I will need to consider:
How can I ensure that each cell entry meets certain criteria of the column it resides in? For example, all entries in the 'Date' column should be in the format 'yyyy-mm-dd', etc.
What tools/languages are best for handling such large data? I use Python and the Pandas module often for DataFrame manipulation, and am aware of the read_sql function, but I think that this much data will simply take too long to process in Python.
I know how to manually process the data chunk-by-chunk in Python, however I think that this is probably too inefficient and the script could take well over 12 hours.
Simply put or TLDR: I'm looking for a simple, streamlined solution to manipulating and performing QC analysis on postgreSQL data.

How do you generate a CAD geometry of randomly oriented objects?

How can one generate CAD geometries of randomly oriented and randomly sized objects (3D)? I need to model randomly sized and randomly oriented rectangles--thousands to millions of them.
I have not yet come across any CAD tools that have =rand() functions that can be inputted into dimensions. Is one way perhaps to have a CAD program import a CSV file of these randomly generated parameter values?
In SolidWorks, you can have model parameters (dimension lengths/angles, constraints, etc.) stored in an Excel spreadsheet called a Design Table. Each row in the spreadsheet will represent a different configuration of your model, and each column a different parameter. You can use Excel's built-in capabilities or an export-capable tool of your choosing to generate the configurations according to your desired distribution. I don't recall off the top of my head the easiest way to get a large number of instances with different configurations into the same assembly, but you haven't really told us what you're trying to accomplish so I can't give you specific recommendations anyways.
If you have a specific CAD tool then you can often find documentation on the internal file format. With a little experimentation you can sometimes write a small external program that will generate the header of the CAD file and then loop thousands or millions of times generating each individual object. Finally you generate the lines needed to complete the file. That can sometimes be easier than trying to force a tool to do something the designers never expected. And this might let you use the software of your choice to generate the file.
I would suggest starting small. Use the CAD tool to create a file with two or three of your rectangles. Save and inspect the contents of the file to see that it matches your understanding of the needed format. Then try externally creating what should be the same file and verify your version is correctly accepted.
You might consider that some tool designers never expected someone to want thousands or millions of anything. I would suggest sneaking up on the problem. Try doubling the number of items, check this works as expected and then repeat this process again and again until either you successfully get to millions or until you find the CAD tool won't be able to handle this.

What are the differences between Tables and Categorical Arrays, and cell and struct arrays?

In the newest version of MATLAB there are two new data types: Tables and Categorical Arrays.
Table is a new data type suitable for holding data and metadata, and can be used with mixed-type tabular data that are often stored as columns in a text file or in a spreadsheet. It consists of rows and column-oriented variables.
Categorical arrays are useful for holding categorical data - which have values from a finite list of discrete categories.
In previous versions I would have handled these use cases using cell and struct arrays. What are the differences between these and the new data types?
I haven't upgraded yet so I can't play around but based on this video and this article I can already see some advantages. They're not necessarily adding functionality that you couldn't do before, but rather just taking the hassle out of it. Using readtable over xlsread is immediately appealing to me. Being able to access columns by name rather than just by index is great, I do it in other languages often. In a table where column order doesn't really matter (unlike a matrix) it's really convenient to be able to address a column by it's name instead of having to know the column order. Also you can merge table using the join function which wasn't that easy to do with cell arrays before. I see that you can name the rows too, I didn't see what advantage that gives you and I can't play around but I know in some languages (like PANDAS in Python and I think in R as well) naming rows means you can work with time series data with different series that are not completely overlapping and not have to worry about alignment. I hope this is the case in Matlab too! Categorical arrays also look like just an extra layer of convenience, kind of like an enum. You never actually need a enum but it just makes development more pleasant.
Anyway that's just my two cents, I probably won't get an opportunity to play around with them any time soon but I look forward to using them when I do need them.
I use the table format to organize different input/output cases in my data, where the result may come from different tables. Main advantages compared to struct or cell array:
convenient table functions such as join, innerjoin, outerjoin
the use of fields <> more robust programming than arrays
data format is easy to export/import (e.g. delimited .txt file) <> no fprintf()
the data file can be opened in excel/Calc (libreoffice) <> no .mat

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.