Where clause versus join clause in Spark SQL - scala

I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:
Table A is:
Name Profession City
John Engineer Palo Alto
Jack Doctor SF
Table B is:
Profession City NewJobOffer
Engineer SF Yes
and I'm interested to get Table c:
Name Profession City NewJobOffer
Jack Engineer SF Yes
I can do this in two ways using where clause or join query which one is faster and why in spark sql?
Where clause to compare the columns add select those records or join on the column itself, which is better?

It's better to provide filter in WHERE clause. These two expressions are not equivalent.
When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.
When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.

Related

How to use OPTIMIZE ZORDER BY in Databricks

I have two dataframes(from a delta lake table) that do a left join via an id column.
sd1, sd2
%sql
select
a.columnA,
b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id
The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY Youtube video
according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.
But since the two dataframes use the id in the join condition, could it be interesting to order that column?
spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')
the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?
Thanks ind advance
OPTIMIZE ZORDER may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.
You can also tune a file sizes, to avoid scanning of too many smaller files.
But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...

How to combine two separate aggregations together in same result in mongodb with join

I have two different aggregate collections, in the first collection I simply retrieve data from mongo collection, and in the second collection I retrieve date from 3 collection using joins (lookup, match) and groups.
Now I want to apply join on first and second collection using MongoDB
eg.
select a1.name , a2.count
from (select * from a where xyz group by name) as a1
LEFT OUTER JOIN
(select * from b(this contains multiple table joins)) as a2
on t1.id=t2.id
Welcome to Stack Overflow, Pravin.
As MongoDB documentation states, it is most useful in cases where your data model is denormalized (i.e. related data is stored within a single document).
So your first solution is to design your schema in a way that would
help you avoid making complex join queries.
You also have an option to make two separate queries to the database
and use Manual references, like foreign keys in RDBMS.
You also have an option to use so-called DBrefs
Depending on your particular task one of these options may be applicable

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

How to retrieve a list of Columns from a single row in Cassandra?

The below is a sample of my Cassandra CF.
column1 column2 column3 ......
row1 : name:abay,value:10 name:benny,value:7 name:catherine,value:24 ................
ComparatorType:utf8
How can i fetch columns with name ('abay', 'john', 'peter', 'allen') from this row in a single query using Hector API.
The number of names in the list may vary every time.
I know that i can get them in a sorted order using SliceQuery.
But there are cases when i need to fetch data randomnly, as i mentioned above.
Kindly help me.
Based on your query, it seems you have two options.
If you only need to run this query occasionally, you can get all columns for the row and filter them on the client. If you have at most a few thousand columns, this should be ok for an occasional query.
If you need to run this frequently, you'll want to write the data such that you can query using name as the key. This probably means you'll have to write the data twice into two CFs, where one is by your current key, and the other is by name. This is a common Cassandra tactic.

Sorting data in a oracle table

Is it possible to sort the data inside of an oracle table? Like ascending/descending via a certain column, alphabetically. Oracle 10g express.
You could try
Select *
from some_table
order by some_column asc
This will sort the results by some_column and place them in ascending order. Use desc instead of asc if you want descending order. Or did you mean to have the ordering in the physical storage itself?
I believe it's possible to specify the ordering/sorting of an indexed column in storage. It's probably closest to what you want. I don't usually use this index sort feature, but for more info see: http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/statements_5010.htm#i2062718
Perhaps you could use an index organized table - IOT to ensure that the data is stored ordered by index.
Have a look at the physical properties clause of the CREATE TABLE statement:
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_7002.htm#i2128663
What is the problem that you are trying to solve though? An IOT may or may not be what you should be using.
As defined by the relational model, the rows and columns in a table are unordered. That's the theory at least.
In practice, if you want the data to come out of a query in a particular order then you should always use the ORDER BY clause. The order of the output is not guarantee unless you provide that.
It would be possible to use an ORDER BY when inserting into a table but that doesn't guarantee the order that data will be output. A query might come out in the same order every time.... but that doesn't mean it will come out in the same order next time.
There were issues when Oracle 10g came out where aggregate queries (with GROUP BY) were not coming out sorted because users had come to rely on the data being sorted as a side-effect of the grouping. With the introduction of the HASH GROUP BY in addition to the SORT GROUP BY people were caught out. This was a useful reminder that ORDER BY should always be used.
What do you really mean ?
Are you just asking for the order by clause ?
http://www.1keydata.com/sql/sqlorderby.html
Oracle 10g Express supports ANSI SQL like most RDBM's so you can sort in the standard manner:
SELECT * FROM Persons ORDER BY LastName
A good tutorial on SQL can be found here: w3schools SQL
Oracle Express does have some limitations compared to the Enterprise Edition but not in the basic SQL dialect it supports.