Stata combining datasets in memory - append

I'm using census data across multiple years. For each year, the dataset is laid out completely the same in terms of structure and content (i.e. the same questions are asked of respondents and their answers are laid out in the same way across years). I want to import multiple datasets, apply changes to each, and then combine them.
To illustrate what I mean, I imported the 2018 data (the frame is titled 'cps18') and removed three specific rows. I then used frame change and imported the 2019 data, called 'cps19,' and applied similar changes. When I use append using cps18, the console returns "file cps18 not found."
From what I can find online, it seems that the append command is used for combining datasets on the disk with the frame in memory. But what if I have two or more frames in memory? Is there a way to combine them?

Related

spark: convert dataframe to svm labeled point

When referring to the spark ml/mllib docs, they all start from a svm stored example. This is really frustrating me since there doesn't seem to be a straightforward way to go from a standard RDD[Row] or Dataframe (taken from a "table" select) to this notation without first storing it.
This is just an inconvenience when dealing with 3 features or so, but when you scale that up to lots and lots of features, it's implying you will be doing a lot of typing and searching.
I ended up with something like this: (where "train" is a random split of a dataset w/ features stored in a table)
val trainLp = train.map(row => LabeledPoint(row.getInt(0).toDouble, Vectors.dense(row(8).asInstanceOf[Int].toDouble,row(9).asInstanceOf[Int].toDouble,row(10).asInstanceOf[Int].toDouble,row(11).asInstanceOf[Int].toDouble,row(12).asInstanceOf[Int].toDouble,row(13).asInstanceOf[Int].toDouble,row(14).asInstanceOf[Int].toDouble,row(15).asInstanceOf[Int].toDouble,row(18).asInstanceOf[Int].toDouble,row(21).asInstanceOf[Int].toDouble,row(27).asInstanceOf[Int].toDouble,row(28).asInstanceOf[Int].toDouble,row(29).asInstanceOf[Int].toDouble,row(30).asInstanceOf[Int].toDouble,row(31).asInstanceOf[Double],row(32).asInstanceOf[Double],row(33).asInstanceOf[Double],row(34).asInstanceOf[Double],row(35).asInstanceOf[Double],row(36).asInstanceOf[Double],row(37).asInstanceOf[Double],row(38).asInstanceOf[Double],row(39).asInstanceOf[Double],row(40).asInstanceOf[Double],row(41).asInstanceOf[Double],row(42).asInstanceOf[Double],row(43).asInstanceOf[Double])))
This is a nightmare to maintain, since these rows tend to change pretty often.
And here I'm only in the stage of getting labeled points, I'm not even at a svm stored version of this data.
What am I missing here that could potentially save me days of misery?
EDIT:
I got one step closer to the solution using something called a vectorassembler to build up my vector
Usually, CSV files are raw, unfiltered sources of information. Often they provide the original source of information.
In order to build a model, you usually want to go through a data cleansing, data preparation, data wrangling (and maybe more "data x" wording) phase before you build your model. This phase usually takes a big piece of the model building, and usually requires exploration of data. Typically, a process of transformation and feature selection (and creation) occurs between the original data and the data that builds the model.
If your CSV files don't need any of these preliminary phases - good for you!
You can always make configuration files that can keep track of certain columns or column indexes that build your model.
If your DataFrame comes from a "select", I guess what you can do to improve legibility and maintainability is to use column names instead of index numbers.
df.select($"my_col_1", $"my_col_2", .. )
and then operate through
row.getAs[String]("my_col_1")

How do you generate a CAD geometry of randomly oriented objects?

How can one generate CAD geometries of randomly oriented and randomly sized objects (3D)? I need to model randomly sized and randomly oriented rectangles--thousands to millions of them.
I have not yet come across any CAD tools that have =rand() functions that can be inputted into dimensions. Is one way perhaps to have a CAD program import a CSV file of these randomly generated parameter values?
In SolidWorks, you can have model parameters (dimension lengths/angles, constraints, etc.) stored in an Excel spreadsheet called a Design Table. Each row in the spreadsheet will represent a different configuration of your model, and each column a different parameter. You can use Excel's built-in capabilities or an export-capable tool of your choosing to generate the configurations according to your desired distribution. I don't recall off the top of my head the easiest way to get a large number of instances with different configurations into the same assembly, but you haven't really told us what you're trying to accomplish so I can't give you specific recommendations anyways.
If you have a specific CAD tool then you can often find documentation on the internal file format. With a little experimentation you can sometimes write a small external program that will generate the header of the CAD file and then loop thousands or millions of times generating each individual object. Finally you generate the lines needed to complete the file. That can sometimes be easier than trying to force a tool to do something the designers never expected. And this might let you use the software of your choice to generate the file.
I would suggest starting small. Use the CAD tool to create a file with two or three of your rectangles. Save and inspect the contents of the file to see that it matches your understanding of the needed format. Then try externally creating what should be the same file and verify your version is correctly accepted.
You might consider that some tool designers never expected someone to want thousands or millions of anything. I would suggest sneaking up on the problem. Try doubling the number of items, check this works as expected and then repeat this process again and again until either you successfully get to millions or until you find the CAD tool won't be able to handle this.

Neo4j Import tool - Console output meaning

What does the console output of the Neo4j import tool mean?
Example lines:
[INPUT--------------PROPERTIES(2)======|WRITER: W:71.] 3M
[INPUT------|PREPARE(|RELATIO||] 49M
[Relationship --> Relationship + counts-------]282M
When I try to import a large dataset through this tool, it seems that at 248M, importing is hanging in the ‘calculate dense nodes’ step. What exactly does 'calculating dense nodes' do?
The import stages are:
Nodes
Prepare node index
Calculate dense nodes
Node --> Relationship Sparse
Relationship --> Relationship Sparse
Node counts
Relationship counts
As for interpreting the statistics, I guess #mattias-persson wrote it in the Neo4J manual. Copying it, for the record:
10.1.2.5. Output and statistics
While an import is running through its different stages, some statistics and figures are printed in the console. The general interpretation of that output is to look at the horizontal line, which is divided up into sections, each section representing one type of work going on in parallel with the other sections. The wider a section is, the more time is spent there relative to the other sections, the widest being the bottleneck, also marked with *. If a section has a double line, instead of just a single line, it means that multiple threads are executing the work in that section. To the far right a number is displayed telling how many entities (nodes or relationships) have been processed by that stage.
As an example:
[*>:20,25 MB/s-----------|PREPARE(3)==========|RELATIONSHIP(2)===========] 16M
Would be interpreted as:
> data being read, and perhaps parsed, at 20,25 MB/s, data that is being passed on to …​
PREPARE preparing the data for …​
RELATIONSHIP creating actual relationship records and …​
v writing the relationships to the store. This step is not visible in this example, because it is so cheap compared to the other sections.
Observing the section sizes can give hints about where performance can be improved. In the example above, the bottleneck is the data read section (marked with >), which might indicate that the disk is being slow, or is poorly handling simultaneous read and write operations (since the last section often revolves around writing to disk).
After some offline discussion, it seems that most, if not all "missing" nodes were due to one line in the CSV file had a property starting with quotation " but did not contain the end quote. This resulted in the parser reading until the next quote, i.e. through new-lines, thinking that it still read that same property value for that node.
It would be great with some sort of detection for such missing quotes, but that's not straight-forward given that it might mess with nodes/relationships actually spanning multiple lines.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

Efficient disk access of large number of small .mat files containing objects

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).
Five ideas to consider:
Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
A variation on your method #2 is to store them in one or more files, but to turn off compression.
Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
#SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
(New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).
Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/
Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.
The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.