Java code to check if a pcollection is empty - apache-beam

I am trying to write a pipeline to insert/update/delete mysql table based on the pubsub messages.While inserting into a particular table, I will have to check if the data exists in other table and do the insert only when the data is available in the other table
I will have to stop the insertion process , when there is no data in the other table(PCollection).
PCollection recordCount= windowedMatchedCollection.apply(Combine.globally(new CountElements()).withoutDefaults());
This piece of line does not seem to help. Any inputs on this please

It's a little unclear exactly what you're trying to do, but this should be achievable with counting elements. For example, suppose you have
# A PCollection of (table, new_row) KVs.
new_data = ...
# A PCollection of (table, old_row) KVs.
old_data = ...
You could then do
rows_per_old_table = old_data | beam.CombinePerKey(
beam.combiners. CountCombineFn())
and use this to filter out your data with a side input.
def maybe_filter_row(table_and_row, old_table_count):
# table_and_row comes from the PCollection new_data
# old_table_count is the side input as a Map
table = table_and_row[0]
if old_table_count.get(table) > 0:
yield table_and_row
new_data_to_update = new_data | beam.FlatMap(
maybe_filter_row,
old_table_count=beam.pvalue.AsMap(rows_per_old_table))
Now your new_data_to_update will contain only that data for tables that had a non-zero number of rows in old_data.
If you're trying to do this in a streaming fashion, everything would have to be windowed, including old_data, and it would filter out only those things that have data in that same window. You could instead do something like
# Create a PCollection containing the set of tables in new_data, per window.
tables_to_consider = (
new_data
| beam.GroupByKey()
| beam.MapTuple(lambda table, rows: table))
rows_per_old_table = tables_to_consider | beam.ParDo(
SomeDoFnLookingUpCurrentSizeOfEachTable())
# Continue as before.

Related

Calculate hash over a whole column from pyspark dataframe

I have a big data frame (approximatively 40 million rows) which looks like this:
|col A | col B |
|------|-------|
|valA1 | valB1 |
|valA2 | valB2 |
I want to compare two column in different data frames, in different workspaces. I am not allowed to bring both of them to the same environment. What I want is creating a hash value for every column, so I can compare with other columns from the other data frame.
The easy approach would be concatenating all values from a column, and then hash the resulting string. But because of the size of the data frame, I cannot do this.
So far I tried this version, but it takes too long:
hashlib.sha256(''.join(map(str,df.agg(collect_list(col("colName"))).first()[0])).encode('utf-8')).hexdigest()
and also this but same long time:
def compute_hash(df):
hasher = hashlib.sha256()
dataCollect=df.rdd.toLocalIterator()
for row in dataCollect:
hasher.update(row['colName'].encode('utf-8'))
return hasher.hexdigest()
Is this achievable in spark in a reasonable time?
You dont need to hash the whole string at once.
Example using sha256 from hashlibs library
import hashlib
column = ['valA1', 'valA2', 'valA3']
hasher = hashlib.sha256()
for row in column:
hasher.update(row.encode('utf-8'))
print(hasher.hexdigest())
# >>> 68f900960718b4881107929da0918e0e9f50599b12ebed3ec70066e55c3ec5f4
Using the update method will process the data as you use it.
The solution was grouping by a column which contained data evenly distributed. This way, spark would trigger the parallel execution for every value from the "columnToGroupBy" and generate a dataframe containing on the first column all the values of "columnToGroupBy" and on the second column, a hash over the concatenated values of "colToHash" corresponding to that value of "columnToGroupBy". For example if we had this table:
columnToGroupBy
colToHash
ValueA1
ValB1
ValueA2
ValB2
ValueA1
ValB3
ValueA2
ValB4
After aplying this function:
df.groupby("columnToGroupBy").agg(md5(concat_ws(",", array_sort(collect_set(col("colToHash"))))))
We would get the following dataframe:
columnToGroupBy
hash
ValueA1
md5(ValB1
ValueA2
md5(ValB1
If you have this new dataframe with a small number of rows, equal to the number of values you have in "columnToGroupBy", you can easily generate a hash for the whole column by collecting all the values from the "hash" column, concatenating them and hashing over.

Creating a For loop that iterates through all the numbers in a column of a table in Matlab

I am a new user of MatlabR2021b and I have a table where the last column (with name loadings) spans multiple sub-columns (all sub-columns were added under the same variable/column and are threated as one column). I wanto to create a For loop that goes through each separate loading column and iterates through them, prior to creating a tbl that I will input into a model. The sub-columns contain numbers with rows corresponding to the number of participants.
Previously, I had a similar analogy where the loop was iterating through the names of different regions of interest, whereas now the loop has to iterate through columns that have numbers in them. First, the numbers in the first sub-column, then in the second, and so on.
I am not sure whether I should split the last column with T1 = splitvars(T1, 'loadings') first or whether I am not indexing into the table correctly or performing the right transformations. I would appreciate any help.
roi.ic = T.loadings;
roinames = roi.ic(:,1);
roinames = [num2str(roinames)];
for iroi = 1:numel(roinames)
f_roiname = roinames{iroi};
tbl = T1;
tbl.(roinames) = T1.loadings(:,roiname);
**tbl.(roinames) = T1.loadings_rsfa(:,roiname)
Unable to use a value of type cell as an index.
Error in tabular/dotParenReference (line 120)
b = b(rowIndices,colIndices)**

Want to delete all rows of records containing null values in DolphinDB

I have a table where a record may contain null values in one or more columns. I want to delete these records as long as it contains a null value. I'm wondering if there is any suggested way to do that in DolphinDB?
Try DolphinDB function rowAnd to specify the output conditions.
The following script is for your reference. It outputs rows of data only when the columns meet the set conditions (delete the records if NULL contained):
sym = take(`a`b`c, 110)
id = 1..100 join take(int(),10)
id2 = take(int(),10) join 1..100
t = table(sym, id,id2)
t[each(isValid, t.values()).rowAnd()]
The output can be found in the screenshot:

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

Load multiple .csv-files into one table and create ID per .csv -postgres

Heyho. I am using Postgresql 9.5 and I am desperating at a problem.
I have multiple .csv-Files (40) and all of them have the same columncount und -names. I would now like to import them into one table, but I want an ID per .csv-file. Is it possible to automate this in postgres? (including adding a new id column) And how?
The approach might look like this:
test1.csv ==> table_agg ==> set ID = 1
test2.csv ==> table_agg ==> set ID = 2
.
.
.
test40.csv ==> table_agg ==> set ID = 40
I would be very glad if someone could help me
Add a table that contains the filename and other info you would like to add to each dataset. Add a serial column, that you can use as a foreign key in your data table, i.e. a dataset identifier.
Create the data table. Add a foreign key field to refer to the dataset entry in the other table.
Use a Python script to parse and import the csv files into the database. First add the entry to the datasets table. Then determine the dataset ID and insert the rows into the data table with the corresponding dataset ID set.
My simple solution to assign an ID to each .csv-file in Python and to output all .csv-files in one.
import glob, os, pandas as pd
path =r'PathToFolder'
# all .csv-files in this folder
allFiles = glob.glob(path + "/*.csv")
# safe DFs in list_
list_ = []
# DF for later concat
frame = pd.DataFrame()
# ID per DF/.csv
count = 0
for file_ in allFiles:
# read .csv-files
df = pd.read_csv(file_,index_col=None,skiprows=[1], header=0)
# new column with ID per DF
df['new_id'] = count
list_.append(df)
count = count + 1
frame = pd.concat(list_)
frame.to_csv('PathToOuputCSV', index = False)
Continue with SQL:
CREATE TABLE statement..
COPY TABLE_NAME FROM 'PathToCSV' DELIMITER ',' CSV HEADER;