JSC is empty when creating spark DataFrame - pyspark

I'm trying to learn spark so don't judge harshly. I have the following problem. I can run spark basic examples like this one
import os
os.environ['PYSPARK_PYTHON'] = '/g/scb/patil/andrejev/python36/bin/python3'
import random
from pyspark import SparkConf, SparkContext
from pyspark.sql.types import *
from pyspark.sql import *
sc.stop()
conf = SparkConf().setAppName('').setMaster('spark://remotehost:7789').setSparkHome('/path/to/spark-2.3.0-bin-hadoop2.7/')
sc = SparkContext(conf=conf)
num_samples = 100
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
but when I am creating data frame I have error that _jsc is NULL
eDF = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
/usr/local/spark/python/pyspark/traceback_utils.py in __enter__(self)
70 def __enter__(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCa._spark_stack_depth += 1
Here are the environemnt variables that are set on local machine
SPARK_HOME': '/usr/local/spark/
PYSPARK_DRIVER_PYTHON: '/usr/bin/python3'
PYSPARK_DRIVER_PYTHON_OPTS: 'notebook'
PYSPARK_PYTHON: '/g/scb/patil/andrejev/python36/bin/python3'
PATH': '...:/usr/lib/jvm/java-8-oracle/jre/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/spark/bin'
and on remote machine
PYSPARK_PYTHON=/g/scb/patil/andrejev/python36/bin/python3
PYSPARK_DIRVER_PYTHON=/g/scb/patil/andrejev/python36/bin/python3

In the end I figured out that I had two sessions (one default and one that I created) running in the same time. I ended explicitly using my session to create DataFrame.
sess = SparkSession(sc)
freq_signal = sess.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])

Related

Pyspark / Databricks. Kolmogorov - Smirnov over time. Efficiently. In parallel

Hello StackOverflowers.
I have a pyspark dataframe that consists of a time_column and a column with values.
E.g.
+----------+--------------------+
| snapshot| values|
+----------+--------------------+
|2005-01-31| 0.19120256617637743|
|2005-01-31| 0.7972692479278891|
|2005-02-28|0.005236883665445502|
|2005-02-28| 0.5474099672222935|
|2005-02-28| 0.13077227571485905|
+----------+--------------------+
I would like to perform a KS test of each snapshot value with the previous one.
I tried to do it with a for loop.
import numpy as np
from scipy.stats import ks_2samp
import pyspark.sql.functions as F
def KS_for_one_snapshot(temp_df, snapshots_list, j, var = "values"):
sample1=temp_df.filter(F.col("snapshot")==snapshots_list[j])
sample2=temp_df.filter(F.col("snapshot")==snapshots_list[j-1]) # pick the last snapshot as the one to compare with
if (sample1.count() == 0 or sample2.count() == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
ks_value, p_value = ks_2samp( np.array(sample1.select(var).collect()).reshape(-1)
, np.array(sample2.select(var).collect()).reshape(-1)
, alternative="two-sided"
, mode="auto")
return ks_value
results = []
snapshots_list = df.select('snapshot').dropDuplicates().sort('snapshot').rdd.flatMap(lambda x: x).collect()
for j in range(len(snapshots_list) - 1 ):
results.append(KS_for_one_snapshot(df, snapshots_list, j+1))
results
But the data in reality is huge so it takes forever. I am using databricks and pyspark, so I wonder what would be a more efficient way to run it by avoiding the for loop and utilizing the available workers.
I tried to do it by using a udf but in vain.
Any ideas?
PS. you can generate the data with the following code.
from random import randint
df = (spark.createDataFrame( range(1,1000), T.IntegerType())
.withColumn('snapshot' ,F.array(F.lit("2005-01-31"), F.lit("2005-02-28"),F.lit("2005-03-30") ).getItem((F.rand()*3).cast("int")))
.withColumn('values', F.rand()).drop('value')
)
Update:
I tried the following by using an UDF.
var_used = 'values'
data_input_1 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias('value_list'))
data_input_2 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias("value_list_2"))
windowSpec = Window.orderBy("snapshot")
data_input_2 = data_input_2.withColumn('snapshot_2', F.lag("snapshot", 1).over(Window.orderBy('snapshot'))).filter('snapshot_2 is not NULL')
data_input_final = data_input_final = data_input_1.join(data_input_2, data_input_1.snapshot == data_input_2.snapshot_2)
def KS_one_snapshot_general(sample_in_list_1, sample_in_list_2):
if (len(sample_in_list_1) == 0 or len(sample_in_list_2) == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
print('something')
ks_value, p_value = ks_2samp( sample_in_list_1
, sample_in_list_2
, alternative="two-sided"
, mode="auto")
return ks_value
import pyspark.sql.types as T
KS_one_snapshot_general_udf = udf(KS_one_snapshot_general, T.FloatType())
data_input_final.select( KS_one_snapshot_general_udf('value_list', 'value_list_2')).display()
Which works fine if the dataset (per snapshot) is small. But If I increase the number of rows then I end up with an error.
PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

Python Lightning for sine wave simulation

I am getting an error when i run the below code. The error says
MisconfigurationException: No training_step() method defined. Lightning Trainer expects as minimum a training_step(), train_dataloader() and configure_optimizers() to be defined.
Can someone please let me know what is the issue here? I am very new to Pytorch. I am trying to simulate Sin wave using MLP
import numpy as np ## using again numpy library for Sin function
import torch ## using pytorch
import matplotlib.pyplot as plt
import pytorch_lightning as pl
import torch.optim as optim
from torch import nn
from pytorch_lightning import Trainer
from sklearn.model_selection import train_test_split
import pandas as pd
from torch.utils.data import DataLoader
N=1000 ## 1000 samples to be generated
L=1000 ## length of each sample
T=20 ## width of wave
x = np.random.randn(1000)
y = np.sin(x)
df = pd.DataFrame({'x':x, 'y':y})
train, test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)
target_fields=['y']
train_features, train_targets = train.drop(target_fields, axis=1), train[target_fields]
test_features, test_targets = test.drop(target_fields, axis=1), test[target_fields]
class MLP(pl.LightningModule):
def __init__(self):
super(MLP,self).__init__()
self.fc1 = nn.Linear(1, 10)
self.fc2 = nn.Linear(10, 1)
def forward(self, x):
x = torch.Relu(self.fc1(x))
x = self.fc2(x)
return x
l_rate = 0.2
mse_loss = nn.MSELoss(reduction = 'mean')
def train_dataloader(self):
train_dataset = TensorDataset(torch.tensor(train_features.values).float(), torch.tensor(train_targets[['cnt']].values).float())
train_loader = DataLoader(dataset = train_dataset, batch_size = 128)
return train_loader
def test_dataloader(self):
test_dataset = TensorDataset(torch.tensor(test_features.values).float(), torch.tensor(test_targets[['cnt']].values).float())
test_loader = DataLoader(dataset = test_dataset, batch_size = 128)
return test_loader
def configure_optimizers(self):
return optim.SGD(self.parameters(), lr=l_rate)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self.forward(x)
loss = mse_loss(logits, y)
# Add logging
logs = {'loss': loss}
return {'loss': loss, 'log': logs}
def test_step(self, batch, batch_idx):
x, y = batch
logits = self.forward(x)
loss = mse_loss(logits, y)
correct = torch.sum(logits == y.data)
predictions_pred.append(logits)
predictions_actual.append(y.data)
return {'test_loss': loss, 'test_correct': correct, 'logits': logits}
def test_epoch_end(self, outputs):
avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
logs = {'test_loss': avg_loss}
return {'avg_test_loss': avg_loss, 'log': logs, 'progress_bar': logs }
model = MLP()
trainer = Trainer(max_epochs = 50)
trainer.fit(model)
Error
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
---------------------------------------------------------------------------
MisconfigurationException Traceback (most recent call last)
<ipython-input-9-7bdf5ac9771f> in <module>()
1 model = MLP()
2 trainer = Trainer(max_epochs = 50)
----> 3 trainer.fit(model)
3 frames
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/configuration_validator.py in __verify_train_loop_configuration(self, model)
50 if not has_training_step:
51 raise MisconfigurationException(
---> 52 "No `training_step()` method defined. Lightning `Trainer` expects as minimum a"
53 " `training_step()`, `train_dataloader()` and `configure_optimizers()` to be defined."
54 )
MisconfigurationException: No `training_step()` method defined. Lightning `Trainer` expects as minimum a `training_step()`, `train_dataloader()` and `configure_optimizers()` to be defined.
You are missing 2 parameters in your trainer.fit() call. See the documentation

Optimizing Spark/Scala speed

I have a Spark script that establishes a connection to Hive and read Data from different databases and then writes the union into a CSV file. I tested it with two databases and it took 20 minutes. Now I am trying it with 11 databases and it has been running since yesterday evening (18 hours!). The script is supposed to get between 400000 and 800000 row per database.
My question is: is 18 hours normal for such jobs? If not, how can I optimize it? This is what my main does:
// This is a list of the ten first databases used:
var use_database_sigma = List( Parametre_vigiliste.sourceDbSigmaGca, Parametre_vigiliste.sourceDbSigmaGcm
,Parametre_vigiliste.sourceDbSigmaGge, Parametre_vigiliste.sourceDbSigmaGne
,Parametre_vigiliste.sourceDbSigmaGoc, Parametre_vigiliste.sourceDbSigmaGoi
,Parametre_vigiliste.sourceDbSigmaGra, Parametre_vigiliste.sourceDbSigmaGsu
,Parametre_vigiliste.sourceDbSigmaPvl, Parametre_vigiliste.sourceDbSigmaLbr)
val grc = Tables.getGRC(spark) // This creates the first dataframe
var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // This creates other dataframe which is the union of ten dataframes (one database each)
for(i <- 1 until use_database_sigma.length)
{
if (use_database_sigma(i) != "")
{
sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
}
}
// writing into csv file
val grc_sigma=sigma.union(grc) // union of the 2 dataframes
grc_sigma.cache
LogDev.ecrireligne("total : " + grc_sigma.count())
grc_sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.csv"));
grc_sigma.unpersist()
Not written in an IDE so it might be off somewhere, but you get the general idea.
val frames = Seq("table1", "table2).map{ table =>
spark.read.table(table).cache()
}
frames
.reduce(_.union(_)) //or unionByName() if the columns aren't in the same order
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.format("csv")
.options(Map("header" -> "true", "delimiter" -> "|"))
.save("filePathName")

User Defined Aggregate Function in Spark to implement percentile

I am trying to write udaf to calculate the percentile values.
I need to write the custom function because existing spark function percentile_approx, approx_percentile and percentile uses rounding differently than my need.
I need to use floor instead of midpoint rounding. Is there anyway I can write it in pyspark?
If not how to achieve this in scala?
I need to calculate the percentile using below method:
def percentile_custom(lst, per):
lst.sorted()
rank = (len(lst)+1)*per
ir = math.floor(rank)
ir1 = math.ceil(rank)
if (ir == ir1):
return lst[ir-1]
else:
fr = rank - ir
ir_qh = lst[ir-1]
ir_qh1 = lst[ir]
inter = ((ir_qh1 - ir_qh)*fr) + ir_qh
return math.floor(inter)
Below is the function for the same I have written in pyspark, let me know in case it didn't work out for you :
from pyspark.sql import Window
import math
import pyspark.sql.types as T
import pyspark.sql.functions as F
def calc_percentile(perc_df, part_col, order_col, p_val=[33,66], num_bins=100, max_bins = 100, perc_col="p_band"):
"""
Calculate percentile with nimber of bins on specified columns
"""
win = Window.partitionBy(*part_col).orderBy(order_col)
def perc_func(col, num, max_bins):
step = max_bins / num
return {(p_tile / step): int(
math.ceil(col * (p_tile / float(max_bins)))
) for p_tile in range(step, max_bins + step, step)}
perc_udf = F.udf(perc_func, T.MapType(T.IntegerType(), T.IntegerType()))
# perc_df.show()
rank_data = perc_df.filter(
F.col(order_col).isNotNull()
).withColumn(
"rank", F.dense_rank().over(win)
)
rank_data.persist()
rank_data.count()
overall_count_data = rank_data.groupBy(
*part_col
).agg(
F.max(
F.col("rank")
).alias("count")
).select(
F.explode(
perc_udf(F.col("count"), F.lit(num_bins), F.lit(max_bins))
).alias("n_tile", "rank"), "count",
*part_col
)
overall_count_data.persist()
overall_count_data.count()
return overall_count_data.join(
rank_data, part_col + ["rank"]
).withColumn(
perc_col,
F.concat(F.lit("P_"), F.col("n_tile").cast("string"))
).groupBy(
*part_col
).pivot(
perc_col, ["P_{0}".format(p_val1) for p_val1 in p_val]
).agg(
F.max(order_col)
).select(
*(
part_col + [F.col("P_{0}".format(p_val1)) for p_val1 in p_val]
)
)

Duplicate values in read from file minibatches TensorFlow

I followed the tutorial about Reading data with TF and made some tries myself. Now, the problem is that my tests show duplicate data in the batches I created when reading data from a CSV.
My code looks like this:
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import collections
import numpy as np
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
class XICSDataSet:
def __init__(self, height=20, width=195, batch_size=1000, noutput=15):
self.depth = 1
self.height = height
self.width = width
self.batch_size = batch_size
self.noutput = noutput
def trainingset_files_reader(self, data_dir, nfiles):
fnames = [os.path.join(data_dir, "test%d"%i) for i in range(nfiles)]
filename_queue = tf.train.string_input_producer(fnames, shuffle=False)
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [[.0],[.0],[.0],[.0],[.0]]
data_tuple = tf.decode_csv(value, record_defaults=record_defaults, field_delim = ' ')
features = tf.pack(data_tuple[:-self.noutput])
label = tf.pack(data_tuple[-self.noutput:])
depth_major = tf.reshape(features, [self.height, self.width, self.depth])
min_after_dequeue = 100
capacity = min_after_dequeue + 30 * self.batch_size
example_batch, label_batch = tf.train.shuffle_batch([depth_major, label], batch_size=self.batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
with tf.Graph().as_default():
ds = XICSDataSet(2, 2, 3, 1)
im, lb = ds.trainingset_files_reader(filename, 1)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
tf.train.start_queue_runners(sess=sess)
for i in range(1000):
lbs = sess.run([im, lb])[1]
_, nu = np.unique(lbs, return_counts=True)
if np.array_equal(nu, np.array([1, 1, 1])) == False:
print('Not unique elements found in a batch!')
print(lbs)
I tried with different batch sizes, different number of files, different values of capacity and min_after_dequeue, but I always get the problem. In the end, I would like to be able to read data from only one file, creating batches and shuffling the examples.
My files, created ad hoc for this test, have 5 lines each representing samples, and 5 columns. The last column is meant to be the label for that sample. These are just random numbers. I'm using only 10 files just to test this out.
The default behavior for tf.train.string_input_producer(fnames) is to produce an infinite number of copies of the elements in fnames. Therefore, since your tf.train.shuffle_batch() capacity is larger than the total number of elements in your input files (5 elements per file * 10 files = 50 elements), and the min_after_dequeue is also larger than the number of elements, the queue will contain at least two full copies of the input data before the first batch is produced. As a result, it is likely that some batches will contain duplicate data.
If you only want to process each example once, you can set an explicit num_epochs=1 when creating the tf.train.string_input_producer(). For example:
def trainingset_files_reader(self, data_dir, nfiles):
fnames = [os.path.join(data_dir, "test%d" % i) for i in range(nfiles)]
filename_queue = tf.train.string_input_producer(
fnames, shuffle=False, num_epochs=1)
# ...