How to do a fuzzy join and a difference join at the same time - left-join

I'm trying to join to datasets where the join fields are 1. a numeric column that i want to allow a difference threshold of 0.05, and 2. exact character matches of two other fields. See below for a simplified example of the two datasets and the desired output:
# df1
site genus distance
HA Melaleuca 0.1
HA Melaleuca 0.1
HA Eucalyptus 0.3
HA Melaleuca 0.6
HA Eucalyptus 1.3
HA Eucalyptus 1.55
HA Eucalyptus 1.55
HA Melaleuca 1.75
HA Melaleuca 1.8
HA Melaleuca 1.9
#df2
site genus distance 1998 2008
HA Eucalyptus 0.1 na 4
HA Melaleuca 0.1 4 4
HA Eucalyptus 0.3 4 d
HA Melaleuca 0.65 4 3
HA Melaleuca 1.8 na 4
HA Eucalyptus 1.6 5 4
HA Eucalyptus 2.1 2 3
HA Melaleuca 2.5 4 4
HA Melaleuca 2.6 4 3
HA Eucalyptus 2.7 2 n/a
#desired join output
site genus distance 1998 2008
HA Melaleuca 0.1 na 4
HA Melaleuca 0.1 4 4
HA Eucalyptus 0.3 4 d
HA Melaleuca 0.6 4 3
HA Eucalyptus 1.3 na na
HA Eucalyptus 1.55 5 4
HA Eucalyptus 1.55 5 4
HA Melaleuca 1.75 na na
HA Melaleuca 1.8 na na
HA Melaleuca 1.9 na na
HA Eucalyptus 2.1 2 3
HA Melaleuca 2.5 4 4
HA Melaleuca 2.6 4 3
HA Eucalyptus 2.7 2 n/a
The function difference_full_join() [fuzzyjoin package] allows to specify a distance threshold (in the above case, it matches any with a "distance" value within 0.05), but I can't within this use a exact character match to make sure that the site and genus columns are the same. That second part is easy enough to do in the fuzzy_full_join() function by specifying == as the function to match on, but I don't know how I can write a function for the match_fun = argument so that it performs the same thing. all examples of this function I can find online use simpler terms like <=.

I wanted to post an answer that I found as a workaround.
Because the character matches were indentical, and not fuzzy, I added a numeric factor column to the two original datasets (so Eucalyptus = 1, Melaleuca = 2, and HA = 1 in example above. Then, I did a difference_full_join with a difference of 0.5 - because the factors were a perfect match, this meant that the difference_join was only used on the distance column.

Related

Lag to read message from kafka topic by storm spout

While ingesting message on kafka topic, storm spout is not picking up immediately. There is lag of more than 1 hrs.
There is one spout and 3 bolt in Topology.
Spout- ddl
Bolt- kafkabolt, deletebolt, deletemapperbolt
Storm Config:
ddl.spout.executors: 3
topology.spout.executors: 10
topology.acker.executors: 3
topology.bolt.executors.kafkabolt: 2
topology.bolt.executors.deletebolt: 3
topology.bolt.tasks.deletebolt: 3
topology.max.spout.pending: 1
topology.bolt.executors.deletemapperbolt: 3
topology.bolt.tasks.deletemapperbolt: 3
topology.message.timeout.secs: 300
topology.max.task.parallelism: 100
topology.workers: 1
topology.debug: false
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.receiver.buffer.size: 64
topology.transfer.buffer.size: 64

postgresql streaming replication slow on macOS

I am using PostgreSQL 10.1 on MAC on which I am trying to set up streaming replication. I configured both master and slave to be on the same machine. I find the streaming replication lag to be slower than expected on mac. The same test runs on a Linux Ubuntu 16.04 machine without much lag.
I have the following insert script.
for i in $(seq 1 1 1000)
do
bin/psql postgres -p 8999 -c "Insert into $1 select tz, $i * 127361::bigint, $i::real, random()*12696::bigint from generate_series('01-01-2018'::timestamptz, '02-01-2018'::timestamptz, '30 sec'::interval)tz;"
echo $i
done
The lag is measured using the following queries,
SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn();
SELECT (extract(epoch FROM now()) - extract(epoch FROM pg_last_xact_replay_timestamp()))::int;
However, the observation is very unexpected. The lag is increasing from the moment the transactions are started on master.
Slave localhost_9001: 12680304 1
Slave localhost_9001: 12354168 1
Slave localhost_9001: 16086800 1
.
.
.
Slave localhost_9001: 3697460920 121
Slave localhost_9001: 3689335376 122
Slave localhost_9001: 3685571296 122
.
.
.
.
Slave localhost_9001: 312752632 190
Slave localhost_9001: 308177496 190
Slave localhost_9001: 303548984 190
.
.
Slave localhost_9001: 22810280 199
Slave localhost_9001: 8255144 199
Slave localhost_9001: 4214440 199
Slave localhost_9001: 0 0
It took around 4.5 minutes for a single client inserting on a single table to complete on master and another 4 minutes for the slave to catch up. Note that NO simultaneous selects are run other than the script to measure the lag.
I understand that replay in PostgreSQL is pretty simple like, "move a particular block to a location", but I am not sure about this behavior.
I have the following other configurations,
checkpoint_timeout = 5min
max_wal_size = 1GB
min_wal_size = 80MB
Now, I run the same tests with same configurations on a Linux Ubuntu 16.04 machine, I find the lag perfectly reasonable.
Am I missing anything?

metastore_db doesn't get created with apache spark 2.2.1 in windows 7

I want read CSV files using latest Apache Spark Version i.e 2.2.1 in Windows 7 via cmd but unable to do so because there is some problem with the metastore_db. I tried below steps:
1. spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 //Since my scala
// version is 2.11
2. val df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("file:///D:/ResourceData.csv")// As //in latest versions we use SparkSession variable i.e spark instead of //sqlContext variable
but it throws me below error:
Caused by: org.apache.derby.iapi.error.StandardException: Failed to start database 'metastore_db' with class loader o
.spark.sql.hive.client.IsolatedClientLoader
Caused by: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database
I am able to read csv in 1.6 version but I want to do it in latest version. Can anyone help me with this?? I am stuck since many days .
Open Spark Shell
spark-shell
Pass Spark Context through SQLContext and assign it to sqlContext Variable
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // As Spark context available as 'sc'
Read the CSV file as per your requirement
val bhaskar = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/home/burdwan/Desktop/bhaskar.csv") // Use wildcard, with * we will be able to import multiple csv files in a single load ...Desktop/*.csv
Collect the RDDs and Print
bhaskar.collect.foreach(println)
Output
_a1 _a2 Cn clr clarity depth aprx price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Good J VVS2 63 57 336 3.94 3.96 2.48
Finally even this also worked only in linux based O.S. Download apache spark from the official documentation and set it up using this link. Just verify whether you are able to invoke spark-shell. Now enjoy loading and performing actions with any type of file with the latest spark version. I don't know why its not working on windows even though I am running it for the first time.

Why is Spark breaking my stage into 3 different stages with the same description and DAG?

I have a 5 worker node cluster with 1 executor each and 4 cores per executor.
I have an rdd that is spread over 20 partitions that I check with the rdd.isEmpty method. In the spark history server, I can see three different "jobs" with the same "description":
JobId Description Tasks
3 isEmpty at myFile.scala:42 1/1
4 isEmpty at myFile.scala:42 4/4
5 isEmpty at myFile.scala:42 15/15
When I click into these jobs/stages, they all have the same DAG. What might be causing the isEmpty stage to get broken into 3 different stages?
Additionally, when I change the RDD from 20 to 8 partitions, the History server shows:
JobId Description Tasks
3 isEmpty at myFile.scala:42 1/1
4 isEmpty at myFile.scala:42 4/4
5 isEmpty at myFile.scala:42 3/3
In both cases the sum of tasks from these 3 stages equal the total number of partitions for the rdd. Why doesn't it just put in all in one stage like:
JobId Description Tasks
3 isEmpty at myFile.scala:42 20/20

Tuning kafka performance to get 1 Million messages/second

I'm using 3 VM servers, each one has 16 core/ 56 GB Ram /1 TB, to setup a kafka cluster. I work with Kafka 0.10.0 version. I installed a broker on two of them. I have created a topic with 2 partitions, 1 partition/broker and without replication.
My goal is to attend 1 000 000 messages / second.
I made a test with kafka-producer-perf-test.sh script and i get between 150 000 msg/s and 204 000 msg/s.
My configuration is:
-batch size: 8k (8192)
-message size: 300 byte (0.3 KB)
-thread num: 1
The producer configuration:
-request.required.acks=1
-queue.buffering.max.ms=0 #linger.ms=0
-compression.codec=none
-queue.buffering.max.messages=100000
-send.buffer.bytes=100000000
Any help will be appreciated to get 1 000 000 msg / s
Thank you
You're running an old version of Apache Kafka. The most recent release (0.11) had improvements including around performance.
You might find this useful too: https://www.confluent.io/blog/optimizing-apache-kafka-deployment/