In Redshift, how can we treat similar strings as the same name so we can get the correct count? For example, the column school_name can have the same school in different variances:
Spring Creek Elementary
Spring Creek Elementary School
Spring Creek
So count(distinct school_name) should be 1 and not 3.
Related
I would like to aggregate some columns into an array or json object on redash.
The data table is on presto database. I need to query it from pyspark hive.
The data table is large and I need to keep its size as small as possible so that I can save the dataframe to s3 faster and then read it as parquet from s3 efficiently.
I am not sure what the best data structure should be for this? (json object? array of array ?)
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I want (this may not be the best data structure):
id. cat_item_aggregation
1. [['furniture', ['living office', ['desk', '4 corners', '1.5'], ['chair', '4 legs', '0.8']], ['restroom', [['tub', 'white wide', '2.7']], ['cloth', ['fashion', ['T-shirt', 'black', '1.1']]]]
I have tried array_agg from PostgreSQL: Efficiently aggregate array columns as part of a group by
Postgres - aggregate two columns into one item
also json_build_object from Return as array of JSON objects in SQL (Postgres)
How to group multiple columns into a single array or similar?
but they do not work in redash.
Could anybody let me know what the best data structure should be for this kind of table ?
and how to build it ?
json may be better than array of array because it is hard to decompose the elements from array of array ?
thanks
I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:
Table A is:
Name Profession City
John Engineer Palo Alto
Jack Doctor SF
Table B is:
Profession City NewJobOffer
Engineer SF Yes
and I'm interested to get Table c:
Name Profession City NewJobOffer
Jack Engineer SF Yes
I can do this in two ways using where clause or join query which one is faster and why in spark sql?
Where clause to compare the columns add select those records or join on the column itself, which is better?
It's better to provide filter in WHERE clause. These two expressions are not equivalent.
When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.
When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.
2 stored procedures are developed by .net developers. which are giving same record counts when you pass the same parameter?
now due to some changes , we are getting mismatch record count i.e
if first stored procedure is giving 2 records for a paramemter , the second SP is giving only 1 record.
to find this i followed the approach like
i verified
i counted total records of a table after joining
total tables used in joining
3.distinct / group by is used in 2 tables or not?
finally i am not able to find the issue.
how do i fix it?
could any body share some ideas.
thanks in advance?
Assuming the same JOINs and filters, then the problem is NULLs.
That is, either
A WHERE clause has a direct NULL comparison which will fail
A COUNT is on a nullable column. See Count(*) vs Count(1) for more
Either way, why do you have the same very similar stored procedures written by 2 different developers, that appear to have differences?
Suppose I have a relationship 1 to N, for example
Student , College.
Student Attributes:
Name,Surname,CollegeFKey,
College attributes:
CollegeKey,Other,Other.
Suppose that I have a program which read students and Exams from a plain text file. And, on this file I have duplicated Colleges and Duplicated Studends.
Like in denormalized tables:
CollegeId,Other,Other,Name,Surname,CollegeFkey.
e.g.
1,x,y,Mike,M,1
1,x,y,R,P,1
...
...
...
You see, I have to check in this case always that in my normalized db, I have still not inserted in the Table College 2 times the key 1.
How can I solve this in Hbase or Cassandra? I mean, if I have 10000.. tables and rows, I don't want check for every Primary Key and then for every FK, if it was inserted OK?
How can I solve that? I can use no-sql db for work directly in de-normalized datas?
Can you link me to an example that solve this problem?
You can use Cassandra http://wiki.apache.org/cassandra/ with some high level language client (I use Hector for java https://github.com/rantav/hector). In Cassandra you will describe ColumnFamily College in this ColumnFamily you write Student columns which contains information about students.
I use MS SQL Server 2008 R2.
I've got the problem, please, excuse the long explanation.
We've got the SSAS cube. It is under development at this time, but it is partially working and can be accessed through excel.
There are projects: hierarchycal parent-child dimension
There are resources assigned to the project (e.g. man-hours, building materials, technic): dimension with resource types, fact M2M table ProjectId-ResourceId-UnitsCount-Cost
There are milestones for the projects: dimension with milestone types (few are defined), M2M fact table: ProjectId-MilestoneId-...milestone dates: planned/actual start/finish
This is a simplified schema.
I need to create a MS Reporting Services report with the following columns:
Project Hierarchy
several columnts with the pre-defined and "hardcoded" resource type amount. e.g the business wants to see the columnt with man-hours spent, and concrete consumption in cub-meters. Thse two clauses can be hardcoded in the query.
several columns with the pre-defined and "hardcoded" milestone type dates
this is a simplified schema too, more columns with other dimension slices are needed...
The problem is that i cannot find an elegant way to create this report.
in my current version, i have to create 2 datasets and query the resouce and milestone data in separate mdx queries.
then i need to use RS-lookup function to join the data in report outcome.
Please acvise:
is there a possibility to query this data in an single mdx query. when i try something like this:
union({{[Dim Resource].[Measure].[man-hour]} + {[Dim Resource].[Measure].[cub-meter]}},
{[Dim Milestone].[Milestone Type].[ProjectStart]}) i've got "different dimensionality" error. Any workarounds?
if i need to output a formatted value like: "X 'man-hour' / Y 'cub-meter'", i have to use lookup func to get both parts of the formula - any better way?
can i query this data any other way?
Please, indicate the direction of googling
or... should i just query the data from the source tables (this is allowed by security restrictions) with SQL
thank you in advance
Perhaps create a new 'virtual cube' to contain data from both of your existing cubes, then query that one.