I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM).
I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact.
I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC connection through R to get the counts.
bin/pyspark --driver-memory 40g --executor-memory 40g
df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()
When I tried with BIG table (5B records) then no results returned upon completion of query.
All of the aggregation operations are performed after the whole dataset is retrieved into memory into a DataFrame collection. So doing the count in Spark will never be as efficient as it would be directly in TeraData. Sometimes it's worth it to push some computation into the database by creating views and then mapping those views using the JDBC API.
Every time you use the JDBC driver to access a large table you should specify the partitioning strategy otherwise you will create a DataFrame/RDD with a single partition and you will overload the single JDBC connection.
Instead you want to try the following AI (since Spark 1.4.0+):
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
columnName = "<INTEGRAL_COLUMN_TO_PARTITION>",
lowerBound = minValue,
upperBound = maxValue,
numPartitions = 20,
connectionProperties = new java.util.Properties()
)
There is also an option to push down some filtering.
If you don't have an uniformly distributed integral column you want to create some custom partitions by specifying custom predicates (where statements). For example let's suppose you have a timestamp column and want to partition by date ranges:
val predicates =
Array(
"2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10",
"2015-07-11" -> "2015-07-20",
"2015-07-21" -> "2015-07-31"
)
.map {
case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' AND cast(DAT_TME as date) <= date '$end'"
}
predicates.foreach(println)
// Below is the result of how predicates were formed
//cast(DAT_TME as date) >= date '2015-06-20' AND cast(DAT_TME as date) <= date '2015-06-30'
//cast(DAT_TME as date) >= date '2015-07-01' AND cast(DAT_TME as date) <= date '2015-07-10'
//cast(DAT_TME as date) >= date '2015-07-11' AND cast(DAT_TME as date) <= date //'2015-07-20'
//cast(DAT_TME as date) >= date '2015-07-21' AND cast(DAT_TME as date) <= date '2015-07-31'
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
predicates = predicates,
connectionProperties = new java.util.Properties()
)
It will generate a DataFrame where each partition will contain the records of each subquery associated to the different predicates.
Check the source code at DataFrameReader.scala
Does the unserialized table fit into 40 GB? If it starts swapping on disk performance will decrease drammatically.
Anyway when you use a standard JDBC with ansi SQL syntax you leverage the DB engine, so if teradata ( I don't know teradata ) holds statistics about your table, a classic "select count(*) from table" will be very fast.
Instead spark, is loading your 100 million rows in memory with something like "select * from table" and then will perform a count on RDD rows. It's a pretty different workload.
One solution that differs from others is to save the data from the oracle table in an avro file (partitioned in many files) saved on hadoop.
This way reading those avro files with spark would be a peace of cake since you won't call the db anymore.
Related
I have saved the data in warehouse in parquet file format with partition by date type column.
I try to get last N days data from the current date using scala spark.
The file data in saved like as below as warehouse path.
Tespath/filename/dt=2020-02-01
Tespath/filename/dt=2020-02-02
...........
Tespath/filename/dt=2020-02-28
If i read all the data its very hug amount of data.
As your dataset is correctly partitioned using the parquet format, you just need to read the directory Testpath/filename and let Spark do the partition discovery.
It will add a dt column in your schema with the value from the path name : dt=<value>.This value can be used to filter your dataset and Spark will optimize the read by partition pruning all directory which does not match you predicate on the dt column.
You could try something like this :
import spark.implicits._
import org.apache.spark.functions._
val df = spark.read.parquet("Testpath/filename/")
.where($"dt" > date_sub(current_date(), N))
You need to ensure spark.sql.parquet.filterPushdown is set to true (which is default)
I have paritioned data on s3 I would like to access via spectrum. The current format file structure is similar to: s3://bucket/dir/year=2018/month=11/day=19/hour=12/file.parquet
I partitioned the data using glue, by parsing a field I use for timestamps, ts. Most queries I will do will be on the ts field, as they are timestamp range queries that are more granular than daily(may span multiple days, or less than one day, but time is often involved.
How would I go about creating hourly(preferred, daily would work if needed) partitions on my data so when I query the ts(or another timestamp) field, it will access the partitions correctly. If needed I can recreate my data with different paritions. Most examples/docs just bucket data daily, and use the date field in the query.
I would be happy to provide more information if needed.
Thank you!
Example query would be something like:
SELECT * FROM spectrum.data
WHERE ts between '2018-11-19 17:30:00' AND '2018-11-20 04:45:00'
Spectrum is not so intuitive. You probably will need to convert timestamp to year, month, day ...
And than do something like WHERE (year > x AND year < y) AND (month > x1 AND month < x2) AND ...
Looks ugly.
You can consider doing something else :
s3://bucket/dir/date=2018-11-19/time=17:30:00/file.parquet
In that case your query will be more simple
WHERE ( date < '2018-11-19' AND date > '2018-11-17') AND ( time < '17:30:00' AND time > '17:20:00')
OR using BETWEEN
https://docs.aws.amazon.com/redshift/latest/dg/r_range_condition.html
If the partitions are created like mentioned below, it will cater to the query asked by #Eumcoz
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:30:00')
LOCATION 's3path/ts=2018-11-19 17:30:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:40:00')
LOCATION 's3path/ts=2018-11-19 17:40:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-19 17:50:00')
LOCATION 's3path/ts=2018-11-19 17:50:00/';
ALTER TABLE spectrum.data ADD PARTITION (ts='2018-11-20 07:30:00')
LOCATION 's3path/ts=2018-11-20 07:30:00/';
Then if you fire this query, it will return the data in all the above partitions:
select * from spectrum.data where ts between '2018-11-19 17:30:00' and '2018-11-20 07:50:00'
P.S. Please up-vote this if it solves your purpose. (I need 50 reputations to be able to comment on posts :) )
We are using Avro data format and the data is partitioned by year, month, day, hour, min
I see the data stored in HDFS as
/data/year=2018/month=01/day=01/hour=01/min=00/events.avro
And we load the data using
val schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/schema.txt"))
val df = spark.read.format("com.databricks.spark.avro").option("avroSchema",schema.toString).load("/data")
And then using predicate push down for filtering the data -
var x = isInRange(startDate, endDate)($"year", $"month", $"day", $"hour", $"min")
df = tableDf.filter(x)
Can someone explain what is happening behind the scenes?
I want to specifically understand when does the filtering of input files happen and where?
Interestingly, when I print the schema, the fields year, month, day and hour are automatically added, i.e the actual data does not contain these columns. Does Avro add these fields?
Want to understand clearly how files are filtered and how the partitions are created.
Good day,
I wish to merge two dates to next closest.
Datasets are huge 500Mb to 1G so proc sql is out of the question.
I have two data sets. First (Fleet) has observations, second has date and which generation number to use for further processing. Like this:
data Fleet
CreatedPortalDate
2013/2/19
2013/8/22
2013/8/25
2013/10/01
2013/10/07
data gennum_list
date
01/12/2014
08/12/2014
15/12/2014
22/12/2014
29/12/2014
...
What I'd like to have is a link-table like this:
data link_table
CreatedPortalDate date
14-12-03 01/12/2014
14-12-06 01/12/2014
14-12-09 08/12/2014
14-12-11 08/12/2014
14-12-14 08/12/2014
With logic that
Date < CreatedPortalDate and (CreatedPortalDate - date) = min(CreatedPortalDate - date)
What I came up with is a bit clunky and I'm looking for an efficient/better way to accomplish this.
data all_comb;
set devFleet(keep=createdportaldate);
do i=1 to n;
set gennum_list(keep=date) point=i nobs=n;
if createdportaldate > date
and createdportaldate - 15 < date then do;/*Assumption, the generations are created weekly.*/
distance= createdportaldate - date;
output;
end;
end;
run;
proc sort data=all_comb; by createdportaldate distance; run;
data link_table;
set _all_comb(drop=distance);
by createdportaldate;
if first.createdportaldate;
run;
Any ideas how to improve or approach this issue?
Ignorant idea: Could I create hash tables where distance would be stored.
Arrays maybe? somehow.
EDIT:
common format
Done
Where does the billion rows come from?
Yes, there are other data involved but the date is the only linking variable.
Sorted?
Yes, the data is sorted and can be sorted again.
Are gen num dates always seven days apart ?
No. That's the tricky part. Otherwise I could use weekand year(or other binning) as unique identifier.
Huge is a relative term, today's huge is tomorrow's speck.
Key data features indicate a direct addressing lookup scheme is possible
Date values are integers.
Date value ranges are limited.
A date value, or any of the next 14 days will be used as a lookup verifier
The key is a date value, which can be used as an array index.
Load the Gennum lookup once as follows
array gennum_of ( %sysfunc(today()) ) _temporary_;
if last_date then
do index = last_date to date-1;
gennum_of(index) = prev_date;
end;
last_date = date;
And fetch a gennum as
if portaldate > last_date
then portal_gennum = last_date;
else portal_gennum = gennum_of ( portaldate );
If you have many rows due to grouping by account ids, you will have to clear and load up the gennum array per group.
This is a typical application of a sas by statement.
The by statement in a data step is meant to read two or more data sets at onece sorted by a common variable.
The common variable is the date, but it is named differently on both datasets. In sql, you solve that by requiring equality of the one variable to the other Fleet.CreatedPortalDate = gennum_list.date, but the by statement does not allow such construction, so we have to rename (at least) one of them while reading the datasets. That is waht we do in the rename clause within the options of gennum_list
data all_comb;
merge gennum_list (in = in_gennum rename = (date = CreatedPortalDate))
Fleet (in = in_fleet);
by CreatedPortalDate;
I choose to combine the by statement with a merge statement, though a set would have done the job too, but then the order of both input datasets makes a difference.
Also note that I requested sas to create indicator variables in_gennum and in_fleet that indicate in which input dataset a value was present. It is handy to know that this type of variables id not written to the result data set.
However, we have to recover the date from the CreatedPortalDate, of course
if in_gennum then date = CreatedPortalDate;
If you are new to sas, you will be surprised the above statement does not work unless you explicitly instruct sas to retain the value of date from one observation to the nest. (Observation is sas jargon for row.)
retain date;
And here we write out one observation for each observation read from the Fleet dataset.
if in_fleet then output;
run;
The advantages of this approach are
you need much less logic to correctly combine the observations from both input datasets (and that is what the data step is invented for)
you never have to retain an array of values in memory, so you can not have overflow problems
this sollution is of order 1 (O1), in the size of the datasets (apart from the sorting), so we know upfront that doubling the amount of data will only const double the time.
Disclaimer: this answer is under construction.
It will be tested later this week
I have table with created (timestamptz) property. Now, i need to create pagination based on timestamp, because while user is watching first page, new items could be submitted into this table, which will make data inconsistent in case if i'll use OFFSET for pagination.
So, the question is: should i keep created type as timestamptz or it's better to convert it into integer (unix, e.g. 1472031802812). If so, is there any disadvantages? Also, atm i have now() as default value in created - is there alternative function to create unix timestamp?
Let me rewrite things from comments to my answer. You want to use timestamp type instead of integer simply because that's exactly what it was designed for. Doing manual convertions between timestamp integers and timestamp objects is just a pain and you gain nothing. And you will need it eventually for more complex datetime based queries.
To answer a question about pagination. You simply do a query
SELECT *
FROM table_name
WHERE created < lastTimestamp
ORDER BY created DESC
LIMIT 30
If it is first query then you set say lastTimestamp = '3000-01-01'. Otherwise you set lastTimestamp = last_query.last_row.created.
Optimization
Note that if the table is big then ORDER BY created DESC might not be efficient (especially if called parallely with different ranges). In this case you can use moving "time windows", for example:
SELECT *
FROM table_name
WHERE
created < lastTimestamp
AND created >= lastTimestamp - interval '1 day'
The 1 day interval is picked arbitrarly (tune it to your needs). You can also sort results in the app.
If results is not empty then you update (in your app)
lastTimestamp = last_query.last_row.created
(assuming you've done sorting, otherwise you take min(last_query.row.created))
If results is empty then you repeat the query with lastTimestamp = lastTimestamp - interval '1 day' until you fetch something. Also you have to stop if lastTimestamp becomes to low, i.e. when it is lower then any other timestamp in the table (which has to be prefetched).
All of that is under some assumptions for inserts:
new_row.created >= any_row.created and
new_row.created ~ current_time
The distribution of new_row.created is more or less uniform
Assumption 1 ensures that pagination results in consistent data while assumption 2 is only needed for the default 3000-01-01 date. Assumption 3 is to make sure that you don't have big empty gaps when you have to issue many empty queries.
You mean something like this?
select extract(epoch from now())::integer as unix_time