Need advice on efficiently inserting millions of time series data into a Cassandra DB - scala

I want to use a Cassandra database to store time series data from a test site. I am using Pattern 2 from the "Getting started with Time Series Data Modeling" tutorial but am not storing the date to limit the row size as a date, but as an int counting the number of days elapsed since 1970-01-01, and the timestamp of the value is the number of nanoseconds since the epoch (some of our measuring devices are that precise and the precision is needed). My table for the values looks like this:
CREATE TABLE values (channel_id INT, day INT, time BIGINT, value DOUBLE, PRIMARY KEY ((channel_id, day), time))
I created a simple benchmark, taking into account using asynchronity and prepared statements for batch loading instead of batches:
def valueBenchmark(numVals: Int): Unit = {
val vs = session.prepare(
"insert into values (channel_id, day, time, " +
"value) values (?, ?, ?, ?)")
val currentFutures = mutable.MutableList[ResultSetFuture]()
for(i <- 0 until numVals) {
currentFutures += session.executeAsync(vs.bind(-1: JInt,
i / 100000: JInt, i.toLong: JLong, 0.0: JDouble))
if(currentFutures.length >= 10000) {
currentFutures.foreach(_.getUninterruptibly)
currentFutures.clear()
}
}
if(currentFutures.nonEmpty) {
currentFutures.foreach(_.getUninterruptibly)
}
}
JInt, JLong and JDouble are simply java.lang.Integer, java.lang.Long and java.lang.Double, respectively.
When I run this benchmark for 10 million values, this needs about two minutes for a locally installed single-node Cassandra. My computer is equipped with 16 GiB of RAM and a quad-core i7 CPU. I find this quite slow. Is this normal performance for inserts with Cassandra?
I already read these:
Anti-Patterns in Cassandra
Another question on write performance
Are there any other things I could check?

Simple maths:
10 millions inserts/2 minutes ≈ 83 333,33333 inserts/sec which is great for a single machine, did you expect something faster?
By the way, what are the specs of your hard-drives ? SSD or spinning disks ?
You should know that massive insert scenarios are more CPU bound than I/O bound. Try to execute the same test on a machine with 8 physical cores (so 16 vcores with Hyper Threading) and compare the results.

Related

Pyspark count() is taking long time before and after using subtract command

This is my code:
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
spark_df1.count( ) # This command took around 1.40 min for exectuion
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
test_data = spark_df1.sample(fraction=0.001)
spark_df2 = spark_df1.subtract(test_data)
spark_df2.count() #This command is taking more than 20 min for execution. Can any one help why
#its taking long time for same count command?
Why is count() taking long time before and after using subtract command?
The jist is that, subtract is an expensive operation involving joins and distinct incurring shuffled hence would take long time compared to count on spark_df1.count(). How much longer is dependent on the Spark executor configurations and partitioning scheme. Do update the question according to comment to an ind-depth analysis.

PostgreSQL: smaller timestamptz type?

Timestamptz time is 8 bytes in PostgreSQL. Is there a way to get a 6 bytes timestamptz dropping some precision?
6 bytes is pretty much out of the question, since there is no data type with that size.
With some contortions you could use a 4-byte real value:
CREATE CAST (timestamp AS bigint) WITHOUT FUNCTION;
SELECT (localtimestamp::bigint / 1000000 - 662774400)::real;
float4
--------------
2.695969e+06
(1 row)
That would give you the time since 2021-01-01 00:00:00 with a precision of about a second (but of course for dates farther from that point, the precision will deteriorate).
But the whole exercise is pretty much pointless. Trying to save 2 or 4 bytes in such a way will not be a good idea:
the space savings will be minimal; today, when you can have terabytes of storage with little effort, that seems pointless
if you don't carefully arrange your table columns, you will lose the bytes you think you have won to alignment issues
using a number instead of a proper timestamp data type will make your queries more complicated and the results hard to interpret, and it will keep you from using date arithmetic
For all these reasons, I would place this idea firmly in the realm of harmful micro-optimization.

Postgresql max TransactionId > 4 billion

The max transactionId of Postgresql should be 2^31 which is 2 billion, however, when I query the current transactionId from DB via select cast(txid_current() as text) I got the number 8 billion. why does this happen? The autovacuum_freeze_max_age is 200 million.
As the documentation for the function family you are using says:
The internal transaction ID type (xid) is 32 bits wide and wraps around every 4 billion transactions. However, these functions export a 64-bit format that is extended with an "epoch" counter so it will not wrap around during the life of an installation.

Spark jar running for too long [duplicate]

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM).
I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact.
I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC connection through R to get the counts.
bin/pyspark --driver-memory 40g --executor-memory 40g
df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()
When I tried with BIG table (5B records) then no results returned upon completion of query.
All of the aggregation operations are performed after the whole dataset is retrieved into memory into a DataFrame collection. So doing the count in Spark will never be as efficient as it would be directly in TeraData. Sometimes it's worth it to push some computation into the database by creating views and then mapping those views using the JDBC API.
Every time you use the JDBC driver to access a large table you should specify the partitioning strategy otherwise you will create a DataFrame/RDD with a single partition and you will overload the single JDBC connection.
Instead you want to try the following AI (since Spark 1.4.0+):
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
columnName = "<INTEGRAL_COLUMN_TO_PARTITION>",
lowerBound = minValue,
upperBound = maxValue,
numPartitions = 20,
connectionProperties = new java.util.Properties()
)
There is also an option to push down some filtering.
If you don't have an uniformly distributed integral column you want to create some custom partitions by specifying custom predicates (where statements). For example let's suppose you have a timestamp column and want to partition by date ranges:
val predicates =
Array(
"2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10",
"2015-07-11" -> "2015-07-20",
"2015-07-21" -> "2015-07-31"
)
.map {
case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' AND cast(DAT_TME as date) <= date '$end'"
}
predicates.foreach(println)
// Below is the result of how predicates were formed
//cast(DAT_TME as date) >= date '2015-06-20' AND cast(DAT_TME as date) <= date '2015-06-30'
//cast(DAT_TME as date) >= date '2015-07-01' AND cast(DAT_TME as date) <= date '2015-07-10'
//cast(DAT_TME as date) >= date '2015-07-11' AND cast(DAT_TME as date) <= date //'2015-07-20'
//cast(DAT_TME as date) >= date '2015-07-21' AND cast(DAT_TME as date) <= date '2015-07-31'
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
predicates = predicates,
connectionProperties = new java.util.Properties()
)
It will generate a DataFrame where each partition will contain the records of each subquery associated to the different predicates.
Check the source code at DataFrameReader.scala
Does the unserialized table fit into 40 GB? If it starts swapping on disk performance will decrease drammatically.
Anyway when you use a standard JDBC with ansi SQL syntax you leverage the DB engine, so if teradata ( I don't know teradata ) holds statistics about your table, a classic "select count(*) from table" will be very fast.
Instead spark, is loading your 100 million rows in memory with something like "select * from table" and then will perform a count on RDD rows. It's a pretty different workload.
One solution that differs from others is to save the data from the oracle table in an avro file (partitioned in many files) saved on hadoop.
This way reading those avro files with spark would be a peace of cake since you won't call the db anymore.

Data partitioning in s3

We have our data in relational database in single table with columns id and date as this.
productid date value1 value2
1 2005-10-26 24 27
1 2005-10-27 22 28
2 2005-10-26 12 18
Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. Our most frequent queries will be filtering on product id, day, month and year. So trying to load the data partitions in a way to have better query performance.
From what i understood, I can create the partitions like this
s3://my-bucket/my-dataset/dt=2017-07-01/
...
s3://my-bucket/my-dataset/dt=2017-07-09/
s3://my-bucket/my-dataset/dt=2017-07-10/
or like this,
s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
Which will be faster in terms of query as I have 7 years data.
Also, how can i add partitioning for product id here? So that it will be faster.
How can i create this (s3://mybucket/year=2017/month=06/day=01/) folder structures with key=value using spark scala.? Any examples?
We partitioned like this,
s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/
s3://bucketname/2018/03/01/11/30/nest/e1/e1a
minute is rounded to 30 mins. If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day.
It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration.
Hope it helps.