I trying doing the manual sanitization, however I am getting a type mismatch error in performing the calculations.
I also need help in sanitizing the data and getting the insight as per the below instructions:
The column sellerproductcount gives you the count of products in the
form '1-16 of over 100,000 results' , and you can parse out the product count 100,000.
sellerratings - this columns gives you the % and count of positive ratings (e.g. 88% positive
in the last 12 months (118 ratings) ) if parsed correctly
sellerdetails - you can use this text to parse out phone numbers, and email IDs of
merchants, where available, so our team can reach out to them.
businessaddress - this will give you the business locations of the sellers. You can parse them
to identify if a seller is registered in the US , Germany (DE), or China (CN).
Hero Product 1 #ratings and Hero Product 2 #ratings - these 2 columns give you the number of
ratings of the 2 'hero products' or bestselling products of this seller.
I have attached the dataset for the same.
https://docs.google.com/spreadsheets/d/1PSqRCnmFgq7v7RzZaCXXoV0Edp_vM7QO/edit?usp=sharing&ouid=115547990006782902200&rtpof=true&sd=true
Most of this type of data prep can be done with string & RegEx functions like REGEX_MATCH(). Here are a few examples based on the data you shared:
Seller Product Count
INT(REGEXP_EXTRACT([Sellerproductcount], '(\d*,?\d*) results'))
1-16 of over 6,000 results >> 6000
Seller Rating (Percentage)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*)% positive'))
92% positive in the last 12 months (181 ratings) >> 92
Seller Rating (Count)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*) (?:total )?ratings'))
92% positive in the last 12 months (181 ratings) >> 181
Business Country Code
RIGHT([Businessaddress],2)
AM Treptower Park28-30Berlin12435DE >> DE
These examples all have very straightforward patterns that are present in all rows so they can be done pretty easily with one simple calculation. However, something like sellerdetails which is unstructured, inconsistent, and sometimes incomplete will be a bit more of a challenge. You will need to use a couple of different calculations and techniques combined together to find what you are looking for, as well as some manual data prep. Here's an example of how you can pull out email but it won't work for everything:
Email
REGEXP_EXTRACT([Sellerdetails], '([a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)')
Good luck with your data cleaning, I suggest using sites like https://regex101.com/ and https://regexr.com/ to learn more about and help test regular expressions.
Related
I’m trying to create a table with an aggregation of values in the same field but without any calculation function.
I have a Loki LogQL query that transforms to a table with three labels. I want to do a “group by” with two of the labels, and the third one will be aggregated to have all the values in the same line together. For example:
The logs line from the query are:
Apple Buy 20
Apple Buy 20
Apple Sell 45
Apple Sell 45
Banana Buy 30
Banana Buy 30
Banana Buy 20
Banana Buy 20
Banana Sell 45
Banana Sell 45
And after transformations (Labels to fields, Filter by name, Group by - on all three labels, Organize fields), the table looks like this:
Table first example
I couldn't add a picture yet
And I want it to become like this:
Table after wanted change
So for Type Banana with Process Buy, all the values are aggregated together like in a list or vector (We can have the values ordered, but it’s not necessary, the Value is a string of a number).
I have been trying to do the change between the first table to the second and have encountered difficulties completing this change.
Any help would be very appreciated.
I have posted this also if the Grafana community and asking this here as well in the hope of a wider reach and finding an answer.
We're currently serving 250 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 180 Billion ad impressions in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id etc
Currently, we don't have a data warehouse and ad-hoc OLAP support is pretty much non-existent in our organization. This severely limits our ability to run adhoc queries and get a quick grasp about data.
We want to answer the following 2 queries to begin with :-
Q1) Find the total count of ad impressions which were served from "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = d_k
Q2) Find the total count of unique users which saw our ads from "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk = d_k
As I said each impression can have hundreds of dimensions(listed above) and cardinality of each dimension could be from few hundreds(say for dimension Country) to Billions(for e.g User-id).
We want approximate answers and the least infrastructure cost and query response time within < 5 minutes. I am thinking about using Druid and Apache datasketches(Theta Sketch to be precise) for answering Q2 and using the following data-model :-
Date Dimension Name Dimension Value Unique-User-ID(Theta sketch)
2021/09/12 "Country" "US" 37873-3udif-83748-2973483
2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483
.
.
<Other records>
So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2)
I am planning to set k(nominal entries) to 10^5(please comment about what would be suitable k for this use case and expected storage amount required?)
I've also read that the about theta sketch set ops accuracy here
I would like to know if there is a better approach to solve Q2(with or without Druid)
Also I would like to know how can I solve Q1?
If I replace Unique-User-Id with "Impression-Id", can I use the same data model to answer Q1? I believe that if I replace Unique-User-Id with "Impression-Id" then accuracy to count the total impressions would be way worse than that of Q2, because each ad-impression is assigned a unique id and we are currently serving 250 Billion per day.
Please share your thoughts about solving Q1 and Q2.
Regards
kartik
I have a Union table of my various bank accounts to create a personal finance analysis dashboard.
I am trying to make a Running Total to show my total capital available at any given date. Using a Running Total table calculation works, just as much as using a RUNNING_SUM() calculated field. They both work up until I filter the dates. So I am trying to find a way to make the running calculation work without being thrown off by Date Filters (I would like to implement relative dates for visualisation in the dashboard).
My union table has the following relevant data columns:
Order ID: Descending number from 1 for each entry per account.
Date: Date of entry.
Item: Entry name.
Account: Name of bank account.
Amount: +ive for credit or -ive for debit.
Balance: balance after entry value for each given account.
So the table can look like this:
So on 07/05/2019 the Running total should be 229.64.
The running sum formula mentioned above is currently RUNNING_SUM(SUM([Amount])), so if any dates are excluded via filter the running total doesn't add up to the right amount.
A way I can see around the problem could be to get the sum over all accounts of the last balance reading at a given date. The balance is a running total but only if the final entry per time period for all accounts are summed would it work. Would it be possible to make a calculated field that gets the last balance reading for each account at any given date and then sums them?
Or is there a simpler smarter way I am not aware of?
This comes down to an Order of Operations problem. Once you filter the dates the viz doesn't have access to the data anymore.
Your best approach would be to add the running sum to the data source before you bring it into Tableau. Then the running sum isn't a calculated field dependent on the data in the Viz.
I'm a novice Tableau user, trying to help my organization to analyze phone traffic. My data source of incoming phone calls is in an Excel spreadsheet, and is listed like this:
TRANSACTION ID DATETIME
151313:179805 1/2/2018 9:57
151340:108017 1/2/2018 17:27
151395:176211 1/3/2018 15:27
Our total calls per day range from 10 to 50.
I'd like to count days with an identical # of calls, and probably make a Histogram sorted by # of calls on the X-Axis, and # of days w/ that many calls on the Y-Axis.
I feel like this would be a simple Calculated Field, but for the life of me, I'm not getting what I'd do here.
Help! :)
One solution is to define an LOD calc, calls_per_day, as
{ FIXED DateTrunc('day', [DATETIME]) : COUNT("*") }
which in effect, prebuilds a little table in space showing the number of data rows for each day. That works if you have one data row in your input per transaction id.
If transaction ids are repeated, and instead you want the number of transactions for each day, you can use the following variation.
{ FIXED DateTrunc('day', [DATETIME]) : COUNTD([TRANSACTION ID]) }
COUNTD() can be expensive on large data sets, so its better to use an alternative when you have the option.
You can use LoD
{Fixed TRANSACTION ID : Count(Day(DATETIME))}
Try this and post the result
I'm creating an virtual stamp card program for the iphone and have run into an issue with implementing my database. The program essentially has a main points system that can be utitlized through all merchants (sort've like air miles), but i also want to keep track of how many times you've been to EACH merchant
So far, i have created 3 main tables for users, merchants, and transactions.
1) Users table contains basic info like user_id and total points collected.
2) Merchants table contains info like merchant_id, location, total points given.
3) Transactions table simply creates a new row for every time someone checks into each merchant, and records date-stamp, user name, merchant name, and points awarded.
So the most basic way to deal with finding out how many times you've been to each merchant is to query the entire transaction table for both user and merchant, and this will give me a transaction history of how many times you've been to that specific merchant(which is perfect), but in the long run, i feel this will be horrible for performance.
The other straightforward, yet "dumb" method for implementing this, would be to create a column in the users table for EACH merchant, and keep the running totals there. This seems inappropriate, as I will be adding new merchants on a regular basis, and there would need to be new columns added to every user for every time this happens.
I've looked into one-to-many and many-to-many relationships for mySQL databases, but can't seem to come up with something very concrete, as i'm extremely new to web/PHP/mySQL development but i'm guessing this is what i'm looking for...
I've also thought of creating a special transaction table for each user, which will have a column for merchant and another for the # of times visited. Again, not sure if this is the most efficient implementation.
Can someone point me in the right direction?
You're doing the right thing in the sense of thinking up the different options, and weighing up the good and bad for each.
Personally, I'd go with a MerchantCounter table which joins on your Merchant table by id_merchant (for example) and which you keep up-to-date explicitly.
Over time it does not get slower (unlike an activity-search), and does not take up lots of space.
Edit: based on your comment, Janan, no I would use a single MerchantCounter table. So you've got your Merchant table:
id_merchant nm_merchant
12 Jim
15 Tom
17 Wilbur
You would add a single additional table, MerchantCounter (edited to show how to tally totals for individual users):
id_merchant id_user num_visits
12 101 3
12 102 8
15 101 6007
17 102 88
17 104 19
17 105 1
You can see how id_merchant links the table to the Merchant table, and id_user links to a further User table.