Count by value distributed across multiple columns in google cloud dataprep - google-cloud-dataprep

I have a somewhat complex data transformation task that I can not figure out in Google Cloud Data prep. The source data is voter file information. The CSV has 10 columns (among many others) that contain a voter's election participation history. See screenshot. In short, the most recent election you voted in is included as text_election_code_1, the second most recent is in text_election_code_2 and so on. The value of the cell is the code of election itself i.e. GN2016 = 2016 general election.
Ideally I would like to transform this into a lookup matrix to answer questions like "Did voter with id# vote in GN2016?" and "How many people total voted in GN2012?"
As the data stands now it is extremely tough to count by election code because "GN2012" might be in any 1 of 10 columns. I.e. in screenshot below GN2012 is in column 3 for the first 2 rows and column 2 for the 3rd row.
I've done this before with SQL, but I can not figure out how to do this within cloud dataprep. Can anyone steer me in the right direction?
Current data shape (other P.I.I. columns omitted from screenshot)
Ideal data shape (maybe)

I decide against a "wide" table in favor of a "long" one. It was pretty easy to accomplish after all with the "unpivot" option that converted the column values into rows. This example was very helpful: https://cloud.google.com/dataprep/docs/html/Analyze-across-Multiple-Columns_57344575

Related

What is the best way to store data where one column has values that repeat ranging anywhere from 1-300+ times?

I've used web scraping to grab approximately 10,000 movies and all their associated review pages URLs, and the next step for me is to grab every single one of those reviews so that I can get the overall positive/negative reviews using sentiment analysis.
I'm writing all this in Python and am using the Pandas library as my means of pre-processing and structuring all the data. Already I have around 36,000 rows containing the name of the movie in one column and the URLs in the other, with the movie name being repeated over and over again, and with the average reviews per page being 20 I'm looking at roughly 720,000 rows when all things are said and done.
This is for the final project of the college course I'm taking, and throughout my schooling I've come to fear data redundancy in databases. I will eventually be writing all of this to a PostgreSQL database so users can query any movie to get back the prediction, and I'm having a hard time overlooking the fact that these movie titles are being repeated so often.
I was wondering if there was a better way to go about this (which could also hopefully save me some processing time), any help would be greatly appreciated!
I feel like this is more of a direct question than a code issue, but if necessary I can provide any relevant code.
If all the information you have about each movie, there is no redundancy (in the relational sense) , since this is the unique identifier.
You could save some space by having a separate movie table that contains an artificial numeric ID and the name and reference the ID from the main table, but that will make your queries more complicated and seems unnecessary for a small table like this.
What I would be more concerned about is whether the movie name is a good identifier at all: what if two movies have the same name? In this age of remakes, that is not a rarity.

How to append separate datasets to make a combined stacked dataset in Stata without losing information

I'm trying to merge two datasets from two time periods, time 1 & 2, to make a combined repeated measures dataset. There are some observations in time 1 which do not appear in time 2, as the observations are for participants who dropped out after time 1.
When I use the append command in Stata, it appears to drop the observations from time 1 that don't have corresponding data at time 2. It does, however, append observations for new participants who joined at time 2.
I would like to keep the time 1 data of those participants who dropped out, so that I can still use that information in the combined dataset.
How can I tell Stata not to automatically drop these participants?
Thanks,
Steve
Perhaps the best way of interesting people in advising you on your problems is to respect those answer your questions. You have been repeatedly advised, even as recently as yesterday, to review https://stackoverflow.com/help/someone-answers and provide the feedback that reflects itself in the reputation scores of those who take the time to help you.
In any event, append does not work as you describe it. If you take the time to work out a small reproducible example, by creating two small datasets, appending them, and then listing the results, you may find the roots of your misunderstanding. Or you will at least be able to provide substantive information from which others can work, should someone be interested in helping you.

Is OLAP the right approach

I have a requirement to develop a reporting solution for a system which has a large number of data items, with a significant number of these being free text fields. Almost any value in the tables are needed for access to a team of analysts who carry out reporting, analysis and data provision.
It has been suggested that an OLAP solution would be appropriate for the delivery of this, however the general need is to get records not aggregates and each cube would have a large number of dimensions (~150) and very few measures (number of records, length of time). I have been told that this approach will let us answer any questions we ask of it, however we do not have repeated business questions that much but need to list the raw records out.
Is OLAP really a logical way to go with this or will the cubes take too long to process and limit the level of access to the data that the user require?

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.