How to append separate datasets to make a combined stacked dataset in Stata without losing information - merge

I'm trying to merge two datasets from two time periods, time 1 & 2, to make a combined repeated measures dataset. There are some observations in time 1 which do not appear in time 2, as the observations are for participants who dropped out after time 1.
When I use the append command in Stata, it appears to drop the observations from time 1 that don't have corresponding data at time 2. It does, however, append observations for new participants who joined at time 2.
I would like to keep the time 1 data of those participants who dropped out, so that I can still use that information in the combined dataset.
How can I tell Stata not to automatically drop these participants?
Thanks,
Steve

Perhaps the best way of interesting people in advising you on your problems is to respect those answer your questions. You have been repeatedly advised, even as recently as yesterday, to review https://stackoverflow.com/help/someone-answers and provide the feedback that reflects itself in the reputation scores of those who take the time to help you.
In any event, append does not work as you describe it. If you take the time to work out a small reproducible example, by creating two small datasets, appending them, and then listing the results, you may find the roots of your misunderstanding. Or you will at least be able to provide substantive information from which others can work, should someone be interested in helping you.

Related

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

What is the "Tableau" way to deal with changing data?

As a background to this question: I've been using Tableau for some time now, but I've been using code (Python, Swift, etc) as a crutch for getting some of the more complicated things done. My employer is now making me move what I can away from custom code and into retail software packages because it will make things easier to maintain if I get hit by a bus or something.
The scenario: With code, I find it very easy to deal with constantly changing/growing data by using recursion. I know that this isn't something I can do with Tableau, but I've found that for many problems so far there is a "Tableau way" of thinking/doing that can solve a lot of problems. And, I'm not allowed to use Rserve/TabPy.
I have a batch of transactional data that grows every month by about 1.6mil records. What I would like to do is build something in Tableau that can let me track a complicated rolling total across the data without having to do it manually. In my code of choice, it would have been something like:
Import the data into a frame
For every unique date value in the 'transaction date' field, create a new column with that name
Total the number of transaction in each account for that day
Write the data to the applicable column
Move on to the next day
Then create new columns that store the sum total of transactions for that account over all of the 30 day periods available (date through date + 29 days)
Select the max value of the accounts for a customer for those 30-day sums
Dump all of that 30-day data into a new table based on the customer identifier
It's a lot of steps, but with a couple of nice recursive functions, it's done in a snap with a bit of code. Plus, it can handle the data as it changes.
The actual question: How in the world do I approach problems like this within Tableau since my brain goes straight to recursive function land? I can do this manually with Tableau Prep, but it takes manual tweaking every time the data changes. Is there a better way, or is this just not within the realm of what Tableau really does?
*** Edit 10/1/2020: Minor typo fix. ***

What is the best way to store data where one column has values that repeat ranging anywhere from 1-300+ times?

I've used web scraping to grab approximately 10,000 movies and all their associated review pages URLs, and the next step for me is to grab every single one of those reviews so that I can get the overall positive/negative reviews using sentiment analysis.
I'm writing all this in Python and am using the Pandas library as my means of pre-processing and structuring all the data. Already I have around 36,000 rows containing the name of the movie in one column and the URLs in the other, with the movie name being repeated over and over again, and with the average reviews per page being 20 I'm looking at roughly 720,000 rows when all things are said and done.
This is for the final project of the college course I'm taking, and throughout my schooling I've come to fear data redundancy in databases. I will eventually be writing all of this to a PostgreSQL database so users can query any movie to get back the prediction, and I'm having a hard time overlooking the fact that these movie titles are being repeated so often.
I was wondering if there was a better way to go about this (which could also hopefully save me some processing time), any help would be greatly appreciated!
I feel like this is more of a direct question than a code issue, but if necessary I can provide any relevant code.
If all the information you have about each movie, there is no redundancy (in the relational sense) , since this is the unique identifier.
You could save some space by having a separate movie table that contains an artificial numeric ID and the name and reference the ID from the main table, but that will make your queries more complicated and seems unnecessary for a small table like this.
What I would be more concerned about is whether the movie name is a good identifier at all: what if two movies have the same name? In this age of remakes, that is not a rarity.

Get .df schema timings

I am running a .df file on tables of varying size and would like to gather timings for how long each item (add column, add unique index, etc.) in a .df takes to apply on each of these tables. One approach I have used is to partition the delta by individual items and run them separately, but this as a whole takes longer than running them as one .df and I would like the most accurate timings possible. Is there a feature in openedge such as adding extra logging, outputting results, etc. that would allow me to gather timings on how long these items take to run?
If you extract load_df.p from prodict (Here's the KB http://knowledgebase.progress.com/articles/Article/15884),
you can customize to add your own logging, so it will suit your needs. I'm pretty sure it would be simple to code.
But answering your original question, I can't think of any default extra logging option to give you that kind of information.
If you start your session you use to load you df with
-clientlog "c:\tmp\mylog.log" -logentrytypes "4GLTrace" -logginglevel 2
then you should get a trace through the code along with time stamps and so on. I've no idea if it will do what you want, but it's worth a try!

Calculating and reporting Data Completeness

I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)
I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.