Condense long CASE statement into function for approximate string match - data-cleaning

I have a table with many seemingly duplicated company entities (tblOrganisations), for example:
ID
Name
1
Company 1
2
Company One
3
CompanyOne
4
Company One
5
Company One (Pty)Ltd
6
Company 1(Pty) Ltd
7
Company One Pty Ltd
8
Business 1
9
Business One
10
BusinessOne
11
Business One
12
Business One (Pty)Ltd
13
Business 1(Pty) Ltd
14
Business One Pty Ltd
There are many more companies with many different variations in this table (~100k rows).'
I have composed an approximate string match using a case statement which cleanses these entities as per the below:
CASE WHEN tblOrganisations.Name LIKE 'Company%' THEN 'Company One (Pty) Ltd'
WHEN tblOrganisation.Name LIKE 'Business%' THEN 'Business One (Pty) Ltd' etc
Typically I use this case statement in my selects however it can become very cumbersome with all the different variations that I have to account for. **Note that my case statement is far more complex than this and corrects enough variation to get me to my target accuracy.
My question is:
How do I go about creating a function that I can call in my stored procedure in order to maximise the lifespan of my mousewheel? 😏

Related

I need help in data sanitization problem in tableau

I trying doing the manual sanitization, however I am getting a type mismatch error in performing the calculations.
I also need help in sanitizing the data and getting the insight as per the below instructions:
The column sellerproductcount gives you the count of products in the
form '1-16 of over 100,000 results' , and you can parse out the product count 100,000.
sellerratings - this columns gives you the % and count of positive ratings (e.g. 88% positive
in the last 12 months (118 ratings) ) if parsed correctly
sellerdetails - you can use this text to parse out phone numbers, and email IDs of
merchants, where available, so our team can reach out to them.
businessaddress - this will give you the business locations of the sellers. You can parse them
to identify if a seller is registered in the US , Germany (DE), or China (CN).
Hero Product 1 #ratings and Hero Product 2 #ratings - these 2 columns give you the number of
ratings of the 2 'hero products' or bestselling products of this seller.
I have attached the dataset for the same.
https://docs.google.com/spreadsheets/d/1PSqRCnmFgq7v7RzZaCXXoV0Edp_vM7QO/edit?usp=sharing&ouid=115547990006782902200&rtpof=true&sd=true
Most of this type of data prep can be done with string & RegEx functions like REGEX_MATCH(). Here are a few examples based on the data you shared:
Seller Product Count
INT(REGEXP_EXTRACT([Sellerproductcount], '(\d*,?\d*) results'))
1-16 of over 6,000 results >> 6000
Seller Rating (Percentage)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*)% positive'))
92% positive in the last 12 months (181 ratings) >> 92
Seller Rating (Count)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*) (?:total )?ratings'))
92% positive in the last 12 months (181 ratings) >> 181
Business Country Code
RIGHT([Businessaddress],2)
AM Treptower Park28-30Berlin12435DE >> DE
These examples all have very straightforward patterns that are present in all rows so they can be done pretty easily with one simple calculation. However, something like sellerdetails which is unstructured, inconsistent, and sometimes incomplete will be a bit more of a challenge. You will need to use a couple of different calculations and techniques combined together to find what you are looking for, as well as some manual data prep. Here's an example of how you can pull out email but it won't work for everything:
Email
REGEXP_EXTRACT([Sellerdetails], '([a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)')
Good luck with your data cleaning, I suggest using sites like https://regex101.com/ and https://regexr.com/ to learn more about and help test regular expressions.

Design Database schema for billing system

I want to design a database for billing system. In one bill a customer might have purchased multiple different items ,for example fot bill Id 1 customer purchased 2 apples 3 bananas and 1 watermelon. i want to know how i can normalize this database.
This is a pretty standard, basic normalization exercise with a pretty standard solution. The usual approach is to have an orders table containing order ID, customer ID, order date &c., and an order_items table with a record for each line item on the order.

Rails 4 + PostgreSQL (Heroku): complex/scalable queries for analysis/reporting on data

Working on a financial application that tracks sales. However, I'm running into problems trying to create a schema for properly tracking the data for reports (the main point of the app).
A purchase is the foundation of the app. It has several associations (listed below). Each purchase is tracked via a year and month field. A year is the smallest unit a user may filter a report by, so I will only have to show data for each month in that year.
# purchase.rb model
class Purchase < ActiveRecord::Base
# Associations:
# belongs_to :partner
# belongs_to :purchase_type
# belongs_to :purchase_category
# Attributes:
# partner_id => association
# purchase_type_id => association
# purchase_category_id => association
# year => year in integer (2013, 2014, etc...)
# month => month in integer ("January" => 1, etc...)
# amount => amount a product sold for in cents ($10.00 => 1000)
# fee => fee for associated partner (if there is one) in cents ($2.00 => 200)
end
The problem is that I need to show an overview for a given year, which breaks things down by how many purchases were completed, what partners completed them, and what were the fee amounts. I solved that by having YearMetric and MonthMetric tables that are updated everytime a purchase is added/updated/removed. So you add a new purchase for a given year/month, and the corresponding YearMetric and MonthMetric rows are found and updated with +/- the appropriate amounts/fees.
This solution works well for the overview page. However, I also need to be able to view purchases in the context of partners, purchase_types, and purchase_categories. If I followed the same strategy as my overview report, I would have to add the following tables:
PartnerYearMetric, PartnerMonthMetric
PurchaseCategoryYearMetric, PurchaseCategoryMonthMetric
PurchaseTypeYearMetric, PurchaseTypeMonthMetric
So everytime I add a purchase, I would be doing up to 8 additional DB updates (8 finds and then 8 updates).
The items I'm reporting on are total purchases made, average purchases (historical comparison), total amounts/fees for the period, top partners by number of purchases and by most fee amounts, etc...
There has to be a better solution than this. "Live calculation" by updating 8 records for every 1 purchase seems a bit overkill.
What you're doing is maintaining materialized views of the data in the application. It's a form of denormalization. That can be OK as an optimization but should not be your first choice. It can be very error prone, especially in the presence of concurrency, and must be done quite carefully.
Instead, when you wish to generate a summary report, use an aggregate to SUM them, COUNT them, etc as appropriate. See aggregate functions in the Pg docs, rails Calculations, rails aggregates.
You may find it convenient to create a VIEW over the query you use, and then access the view from the application.
If you find performance of calculating the aggregates in real time for the summary to be a problem, and you cannot solve it with proper indexing and tuning, then you should think about denormalizing. Rather than maintaining your materialized views in the app, though, consider using triggers in the database; they're much easier to write in a concurrency-safe way.
You may also want to look up PostgreSQL 9.4's enhanced materialized views support.

Group by "original order" causes looping in crystal report 11

I have a report grouped by Themes-S >> Questions-S there are 8 themes and each theme has a between 17 and 5 questions in it.
The report has 16 pages.
I need to change the ordering from specific to original when I do I end up with 288 pages
Something is looping? I can not figure out how to fix this
(using CR 11)
You just might have a very unoptimized original order, with page break properties set on start/end of group. For example, if your database stores records for 'country' in this order:
Canada
Canada
USA
Canada
USA
Canada
USA
Then with specific order "USA", "Canada", you'd have only 2 groups. With original order, however, you'd have 6 groups. Since the group is changing on (almost) every record, it might seem like it's "looping" over the values, repeating them again.
If you don't want it to do this, you can either (a) not use original order, or (b) change your source data to be better organized.

Database design challenge

I'm creating an virtual stamp card program for the iphone and have run into an issue with implementing my database. The program essentially has a main points system that can be utitlized through all merchants (sort've like air miles), but i also want to keep track of how many times you've been to EACH merchant
So far, i have created 3 main tables for users, merchants, and transactions.
1) Users table contains basic info like user_id and total points collected.
2) Merchants table contains info like merchant_id, location, total points given.
3) Transactions table simply creates a new row for every time someone checks into each merchant, and records date-stamp, user name, merchant name, and points awarded.
So the most basic way to deal with finding out how many times you've been to each merchant is to query the entire transaction table for both user and merchant, and this will give me a transaction history of how many times you've been to that specific merchant(which is perfect), but in the long run, i feel this will be horrible for performance.
The other straightforward, yet "dumb" method for implementing this, would be to create a column in the users table for EACH merchant, and keep the running totals there. This seems inappropriate, as I will be adding new merchants on a regular basis, and there would need to be new columns added to every user for every time this happens.
I've looked into one-to-many and many-to-many relationships for mySQL databases, but can't seem to come up with something very concrete, as i'm extremely new to web/PHP/mySQL development but i'm guessing this is what i'm looking for...
I've also thought of creating a special transaction table for each user, which will have a column for merchant and another for the # of times visited. Again, not sure if this is the most efficient implementation.
Can someone point me in the right direction?
You're doing the right thing in the sense of thinking up the different options, and weighing up the good and bad for each.
Personally, I'd go with a MerchantCounter table which joins on your Merchant table by id_merchant (for example) and which you keep up-to-date explicitly.
Over time it does not get slower (unlike an activity-search), and does not take up lots of space.
Edit: based on your comment, Janan, no I would use a single MerchantCounter table. So you've got your Merchant table:
id_merchant nm_merchant
12 Jim
15 Tom
17 Wilbur
You would add a single additional table, MerchantCounter (edited to show how to tally totals for individual users):
id_merchant id_user num_visits
12 101 3
12 102 8
15 101 6007
17 102 88
17 104 19
17 105 1
You can see how id_merchant links the table to the Merchant table, and id_user links to a further User table.