Note that MR stands for “most recent”:
PHA_NAME,
MR_INSPECTION_DATE,
MR_INSPECTION_COST,
SECOND_MR_INSPECTION_DATE,
SECOND_MR_INSPECTION_COST,
CHANGE_IN_COST
PERCENT_CHANGE_IN_COST
Management has asked that you perform this function using lead or lag functions in SQL.
However, they’re concerned that the files when imported into MySQL Workbench may not properly refer to dates using the correct format. If that is the case, they’ve asked you to investigate how best to convert dates from TEXT to Date format so that the lead/lag functions work as expected.
They’ve also asked that you filter your dataset to only those PHAs that saw an increase in $$ cost, and that you only list the PHA once with no duplicates to avoid noisy data.
Naturally, this would also require you to filter out PHAs that only performed one inspection, so they’ve asked you to remove those as well.
Related
Been working on a module that is working pretty well when using MySQL, but when I try and run the unit tests I get an error when testing under PostgreSQL (using Travis).
The module itself is here: https://github.com/silvercommerce/taxable-currency
An example failed build is here: https://travis-ci.org/silvercommerce/taxable-currency/jobs/546838724
I don't have a huge amount of experience using PostgreSQL, but I am not really sure why this might be happening? The only thing I could think that might cause this is that I am trying to manually set the ID's in my fixtures file and maybe PostgreSQL not support this?
If this is not the case, does anyone have an idea what might be causing this issue?
Edit: I have looked again into this and the errors appear to be because of this assertion, which should be finding the Tax Rate vat but instead finds the Tax Rate reduced
I am guessing there is an issue in my logic that is causing the incorrect rate to be returned, though I am unsure why...
In the end it appears that Postgres has different default sorting to MySQL (https://www.postgresql.org/docs/9.1/queries-order.html). The line of interest is:
The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on
In the end I didn't really need to test a list with multiple items, so instead I just removed the additional items.
If you are working on something that needs to support MySQL and Postgres though, you might need to consider defining a consistent sort order as part of your query.
I'm building a data visualization system for Forex trading and I'm exploring ways of storing the historical Forex trading tick data that I have.
The data are in the form of currency pair (e.g. USD/CAD) chronological ticks of Ask and Bid prices. At the end of the day I need my data to be indexed in Elasticsearch and what I searching for is the best way to get them there.
I found a couple of approaches online; they start out simple but then get complicated. I'm wondering if adding that extra complexity is worth it. Some of my options are:
Storing tick data on PostgreSQL and then via a plugin sync them to Elasticsearch (here)
Storing tick data on PostgreSQL, push them to Logstash and then to Elasticsearch
Finally, storing tick data on PostgreSQL, push them to Redis, then to Logstash, and then to Elasticsearch
My intuition says that solution No 2 would be the ideal one, but what is considered best practice?
It's a good idea to store your data in a long-term storage DB, such as PostgreSQL or similar. That way you can decide at any time whether you need to change your mappings, add fields, remove fields, change their types, or what have you, and then you can easily rebuild your ES index/indices without too much trouble from your primary source of truth (i.e. PostgreSQL) and you always have clean data in ES.
I don't know ZomboDB (solution 1) so I can't really speak for it, all I know is that I'm generally not too fond of tying two different technologies together, it makes it hard to upgrade any of them in case you need/must/want to apply patches or benefit from new features in either of them.
Unless you have big and costly transformations to do on your source data, I feel that solution 3 doesn't bring much, i.e. the additional step of storing data in an intermediary Redis, doesn't bring much in my opinion (your mileage may vary here). It's a good idea to use a temporary store, such as Redis or Kafka, when you may lose data along the pipeline, but in this case, since you have your data in PostgreSQL, you don't really run the risk of losing anything. If at all, you can relaunch your pipeline and rebuild a few days of data.
That leaves solution 2, which would be fine given the information at hand. Using the Logstash JDBC input, you can easily retrieve the latest changes and forward them to ES every x minutes.
Eric from ZomboDB here. I wanted to try and answer your question as it relates to ZDB.
ZomboDB is really designed for full-text searching within Postgres. It's important to note that it's not a tool to synchronize your PG data to Elasticsearch. It's a fully-functional Postgres index type (akin to the built-in types like btree, gin, and gist) that happens to be backed by Elasticsearch. The fact that ZomboDB uses Elasticsearch is really an implementation detail.
While ZDB does provide a number of UDFs that expose access to ES' aggregate facilities, again, it's really designed for text searching.
So if your data is really just pairs of numbers, you're probably better off using ES directly -- especially if you're loading in one batch per day. There's no doubt that ZDB could provide superior aggregate performance compared to standard Postgres "GROUP BY" queries (because it passes it through to Elasticsearch), but you're paying a heavy operational penalty for a limited use-case.
If, on the other hand, your ask/bid data comes with a lot of related metadata, and:
You need PG to be your source of truth,
You need to text-search that metadata (with or without aggregation support), and
You don't want to learn ES and introduce another database system to your application, then...
... ZomboDB could be right for you.
I suspect Stack Overflow isn't the place to get into this, so feel free to contact me via the ways ZDB's github page recommends.
I have a couple of millions entries in a table which start and end timestamps. I want to implement an analysis tool which determines unique entries for a specific interval. Let's say between yesterday and 2 month before yesterday.
Depending on the interval the queries take between a couple of seconds and 30 minutes. How would I implement an analysis tool for a web front-end which would allow to quite quickly query this data, similar to Google Analytics.
I was thinking of moving the data into Redis and do something clever with interval and sorted sets etc. but I was wondering if there's something in PostgreSQL which would allow to execute aggregated queries, re-use old queries, so that for instance, after querying the first couple of days it does not start from scratch again when looking at different interval.
If not, what should I do? Export the data to something like Apache Spark or Dynamo DB and analysis in there to fill Redis for retrieving it quicker?
Either will do.
Aggregation is a basic task they all can do, and your data is smll enough to fit into main memory. So you don't even need a database (but the aggregation functions of a database may still be better implemented than if you rewrite them; and SQL is quite convenient to use.
Jusr do it. Give it a try.
P.S. make sure to enable data indexing, and choose the right data types. Maybe check query plans, too.
I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)
I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.
I'm rewriting some SQL code that generates a "bucket" column based on the current month (that happens to be stored as VARCHAR, ugh).
I noticed that the previous author decided to store a certain number of dates for comparison as variables, and then use a case statement to calculate the bucket.
See SQL Fiddle Here that demonstrates a slice of this (it happens about five times in three different ways similar to this).
Is there any reason I cannot simply get rid of most of the redundant variables and simplify it to a few lines of code (See This Fiddle)? Is the overhead of in-line function calls great enough to justify doing this, or does the query compiler cache the results of that portion anyways?