Spark : subtractByKey issue (pyspark) - pyspark

I have some kinds of issues with subtractByKey.
I have 2 files :
First one is like : (Client ID + Client Mail)
client_id emails
4A85FD8E-197D-2AE3-B939-A527AFF16A04 imperdiet.non.vestibulum#mon***tur.com
D48D530C-CF68-DAF1-18F0-E0A0A03F3E06 rutrum.urna#estm***ncus.net:facilisis#i****m.ca
40815230-25DC-9EA0-01D1-2706B4B56958 iaculis.nec.eleifend#gr****nc.net
...
and the second one : (Only Mail)
pharetra#P****s.com
ut.aliquam#o****m.org
erat#a****e.edu
....
Some lines in the first file can have 2 (or more) mails with this format :
mail:mail
What did i do :
*test1=sc.textFile("file1")
*test2=sc.textFile("file2")
*test3=test1.subtractByKey(test2)
and the result is ... :
[(u'A', u'B'), (u'A', u'D'), (u'A', u'1'), (u'A', u'D'), (u'A', u'D'), (u'A', u'B'), (u'A', u'F'), (u'A', u'E'), (u'A', u'9'), (u'A', u'5'), (u'A', u'9'), (u'A', u'6'), (u'c', u'l'), (u'E', u'8'), (u'E', u'4'), (u'E', u'6'), (u'E', u'6'), (u'E', u'7'), (u'E', u'5'), (u'E', u'5'), (u'E', u'5'), (u'E', u'2'), (u'E', u'8'), (u'C', u'2'), (u'C', u'5'), (u'C', u'6'), (u'C', u'C'), (u'C', u'E'), (u'C', u'3'), (u'C', u'F'), (u'C', u'4'), (u'C', u'B'), (u'C', u'F'), (u'C', u'F'), (u'C', u'8'), (u'C', u'0'), (u'1', u'D'), (u'1', u'2'), (u'1', u'3'), (u'1', u'8'), (u'1', u'0'), (u'1', u'F'), ... ]
I wanted to delete clients in the first file who had their mails in the second file but it did not work.

note: I am not very familiar with pyspark but the spark api should be the
same.
first you should make email as the key
rdd1=sc.textFile("file1").map(lambda line: (line.split(" ")[0], line.split(" ")[1]))
this will give you a rdd of
[(4A85FD8E-197D-2AE3-B939-A527AFF16A04,imperdiet.non.vestibulum#mon***tur.com)]
then as there may be multi email, you should do a flatMapValues()
rdd2 = rdd1.flatMapValues(lambda email: email.split(":"))
this will give you a pair rdd and each one just contains one email
now you can switch the key and value
rdd3=rdd2.map(lambda kv: (kv[1], kv[0]))
now you get a rdd that using user email as a key and the UUID as a value
such as
[(imperdiet.non.vestibulum#mon***tur.com, 4A85FD8E-197D-2AE3-B939-A527AFF16A04)]
now you should find which UUID's email is contained in file2, to do that you should load the second file as a rdd:
secondRdd = sc.textFile("file2").map(lambda line: (line, 1))
and do a join and tweak the join result rdd.
rdd4 = rdd3.join(secondRdd).map(lambda kv: (kv[1][0], kv[0]))
if everything is right now you should get a rdd which is as the format of (UUID, email) which represents all the users whose email occurs in file2,
then you can do a subtractByKey() with rdd1 which we originally got.

Related

psycopg2.errors.FeatureNotSupported: Specified types or functions (one per INFO message) not supported on Redshift tables

I try to get test query to AWS RedShift from AWS Managed AirFlow:
QUERY:
AWS_GET_DATA_FROM_REDSHIFT = """('SELECT * FROM information_schema.tables;')"""
stage_data_from_redshift_to_s3 = FromRedshiftToS3TransferOperator(
task_id=f'Stage_unload_{SCHEMA}_{TABLE}_from_redshift_to_s3_{S3_BUCKET}',
dag=dag,
table=TABLE,
s3_bucket=S3_BUCKET,
s3_prefix=f'{SCHEMA}_{TABLE}',
select_query=AWS_GET_DATA_FROM_REDSHIFT,
unload_options=['CSV']
)
class FromRedshiftToS3TransferOperator(BaseOperator):
"""
Executes an UNLOAD command to s3 as a CSV with headers
:param schema: reference to a specific schema in redshift database
:type schema: str
:param table: reference to a specific table in redshift database
:type table: str
:param s3_bucket: reference to a specific S3 bucket
:type s3_bucket: str
:param s3_key: reference to a specific S3 key
:type s3_key: str
:param redshift_conn_id: reference to a specific redshift database
:type redshift_conn_id: str
:param aws_conn_id: reference to a specific S3 connection
:type aws_conn_id: str
:param verify: Whether or not to verify SSL certificates for S3 connection.
By default SSL certificates are verified.
You can provide the following values:
- ``False``: do not validate SSL certificates. SSL will still be used
(unless use_ssl is False), but SSL certificates will not be
verified.
- ``path/to/cert/bundle.pem``: A filename of the CA cert bundle to uses.
You can specify this argument if you want to use a different
CA cert bundle than the one used by botocore.
:type verify: bool or str
:param unload_options: reference to a list of UNLOAD options
:type unload_options: list
:param autocommit: If set to True it will automatically commit the UNLOAD statement.
Otherwise it will be committed right before the redshift connection gets closed.
:type autocommit: bool
:param include_header: If set to True the s3 file contains the header columns.
:type include_header: bool
"""
ui_color = '#8EB6D4'
#apply_defaults
def __init__(self,
table,
s3_bucket,
s3_prefix,
select_query,
redshift_conn_id='redshift',
aws_conn_id='aws_credentials',
unload_options=tuple(),
autocommit=False,
include_header=False,
*args, **kwargs):
super(FromRedshiftToS3TransferOperator, self).__init__(*args, **kwargs)
self.table = table
self.s3_bucket = s3_bucket
self.s3_prefix = s3_prefix
self.select_query = select_query
self.redshift_conn_id = redshift_conn_id
self.aws_conn_id = aws_conn_id
self.unload_options = unload_options
self.autocommit = autocommit
self.include_header = include_header
if self.include_header and 'HEADER' not in [uo.upper().strip() for uo in self.unload_options]:
self.unload_options = list(self.unload_options) + ['HEADER', ]
def execute(self, context):
aws_hook = AwsHook("aws_credentials")
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook("redshift")
self.log.info(f'Preparing to stage data from {self.select_query} to {self.s3_bucket}/{self.s3_prefix}...')
unload_query = """
UNLOAD {select_query}
TO 's3://{s3_bucket}/{s3_prefix}/{table}_'
with credentials
'aws_access_key_id={access_key};aws_secret_access_key={secret_key}'
{unload_options};
""".format(select_query=self.select_query,
s3_bucket=self.s3_bucket,
s3_prefix=self.s3_prefix,
table=self.table,
access_key=credentials.access_key,
secret_key=credentials.secret_key,
unload_options='\n\t\t\t'.join(self.unload_options))
self.log.info(f'{credentials.access_key}')
self.log.info(f'{credentials.secret_key}')
self.log.info('Executing UNLOAD command...')
redshift_hook.run(unload_query, self.autocommit)
self.log.info("UNLOAD command complete.")
And get an error:
[2021-08-17 10:40:50,186] {{taskinstance.py:1150}} ERROR - Specified types or functions (one per INFO message) not supported on Redshift tables.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/airflow/dags/ownoperators/aws_from_redshift_to_s3_operator.py", line 95, in execute
redshift_hook.run(unload_query, self.autocommit)
File "/usr/local/lib/python3.7/site-packages/airflow/hooks/dbapi_hook.py", line 175, in run
cur.execute(s)
psycopg2.errors.FeatureNotSupported: Specified types or functions (one per INFO message) not supported on Redshift tables.
This error is generated by Redshift and in most cases it is when your query uses a leader-node only function (such as generater_series() - there are a number of these). Look at your select_query code and check if the functions called are valid on compute nodes (run the query in a workbench). Happy to help if you post the query. The issue is the SQL, not the code you posted.
The root of the issue is that leader node information is needed by the compute nodes during execution and this route isn't supported. There can be several causes for this and each has work-arounds.

Using input function with remote files in snakemake

I want to use a function to read inputs file paths from a dataframe and send them to my snakemake rule. I also have a helper function to select the remote from which to pull the files.
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.SFTP import RemoteProvider as SFTPRemoteProvider
from os.path import join
import pandas as pd
configfile: "config.yaml"
units = pd.read_csv(config["units"]).set_index(["library", "unit"], drop=False)
TMP= join('data', 'tmp')
def access_remote(local_path):
""" Connnects to remote as defined in config file"""
provider = config['provider']
if provider == 'GS':
GS = GSRemoteProvider()
remote_path = GS.remote(join("gs://" + config['bucket'], local_path))
elif provider == 'SFTP':
SFTP = SFTPRemoteProvider(
username=config['user'],
private_key=config['ssh_key']
)
remote_path = SFTP.remote(
config['host'] + ":22" + join(base_path, local_path)
)
else:
remote_path = local_path
return remote_path
def get_fastqs(wc):
"""
Get fastq files (units) of a particular library - sample
combination from the unit sheet.
"""
fqs = units.loc[
(units.library == wc.library) &
(units.libtype == wc.libtype),
"fq1"
]
return {
"r1": list(map(access_remote, fqs.fq1.values)),
}
# Combine all fastq files from the same sample / library type combination
rule combine_units:
input: unpack(get_fastqs)
output:
r1 = join(TMP, "reads", "{library}_{libtype}.end1.fq.gz")
threads: 12
run:
shell("cat {i1} > {o1}".format(i1=input['r1'], o1=output['r1']))
My config file contains the bucket name and provider, which are passed to the function. This works as expected when running simply snakemake.
However, I would like to use the kubernetes integration, which requires passing the provider and bucket name in the command line. But when I run:
snakemake -n --kubernetes --default-remote-provider GS --default-remote-prefix bucket-name
I get this error:
ERROR :: MissingInputException in line 19 of Snakefile:
Missing input files for rule combine_units:
bucket-name/['bucket-name/lib1-unit1.end1.fastq.gz', 'bucket-name/lib1-unit2.end1.fastq.gz', 'bucket-name/lib1-unit3.end1.fastq.gz']
The bucket is applied twice (once mapped correctly to each element, and once before the whole list (which gets converted to a string). Did I miss something ? Is there a good way to work around this ?

Using private mibs with snmpv3

I am using below mentioned example to send v3 trap with custom notification mib
http://pysnmp.sourceforge.net/examples/current/v3arch/agent/ntforg/trap-v3.html
But the sendNotification function is accepting only the dotted format as in the example ..
What should I do to pass the symbols instead ?
E.g.(' MY-MIB:testTrap') instead of (1,3,2,0....)
You are linking a historical pysnmp documentation. Please, refer to the actual one.
To pass MIB symbols you should use ObjectIdentifier class which turns MIB names into OIDs behind the scenes.
But SNMP notifications can be quite complicated because they may imply certain other MIB objects to be automatically included with the notification being sent.
But you can still add whatever MIB objects you want through NotificationType.addVarBinds.
Here's a simple example:
from pysnmp.hlapi import *
errorIndication, errorStatus, errorIndex, varBinds = next(
sendNotification(
SnmpEngine(),
CommunityData('public'),
UdpTransportTarget(('demo.snmplabs.com', 162)),
ContextData(),
'trap',
NotificationType(
ObjectIdentity('MY-MIB', 'testTrap')
)
)
)
"Hi again below is a snippet of the way I am using the code . "
from pysnmp.entity import engine, config
from pysnmp.carrier.asynsock.dgram import udp
from pysnmp.entity.rfc3413 import ntforg, context
from pysnmp.proto.api import v2c
from pysnmp.proto import rfc1902
from pysnmp.smi import builder
file = open("."+"/file.txt","r")
line = file.read()
fields = line.split(",")
file.close()
file = open("."+"/credentials.txt","r")
cred = file.read()
fields = cred.split(",")
file.close()
snmpEngine = engine.SnmpEngine(snmpEngineID=v2c.OctetString(hexValue=cred[0]))
config.addV3User(snmpEngine, cred[1], config.usmHMACSHAAuthProtocol, cred[2], config.usmAesCfb128Protocol, cred[3])
config.addTargetParams(snmpEngine, 'my-creds', 'traptest', 'authPriv')
config.addSocketTransport(snmpEngine, udp.domainName, udp.UdpSocketTransport().openClientMode())
config.addTargetAddr(snmpEngine, 'my-nms', udp.domainName, ('127.0.0.1', 162), 'my-creds', tagList='all-my-managers')
config.addNotificationTarget(snmpEngine, 'my-notification', 'my-filter', 'all-my-managers', 'trap')
config.addContext(snmpEngine, '')
config.addVacmUser(snmpEngine, 3, 'traptest', 'authPriv', (), (), (1,3,6))
snmpContext = context.SnmpContext(snmpEngine)
ntfOrg = ntforg.NotificationOriginator()
ntfOrg.snmpContext = snmpContext
ntfOrg.sendNotification(
snmpEngine,
'my-notification',
(1,3,6,1,4,1,46033,1,1,2,2),
(((1,3,6,1,4,1,46033,1,1,1,1), v2c.OctetString(fields[0])),)
)
print('Notification is scheduled to be sent')

How to use Spark BigQuery Connector locally?

For test purpose, I would like to use BigQuery Connector to write Parquet Avro logs in BigQuery. As I'm writing there is no way to read directly Parquet from the UI to ingest it so I'm writing a Spark job to do so.
In Scala, for the time being, job body is the following:
val events: RDD[RichTrackEvent] =
readParquetRDD[RichTrackEvent, RichTrackEvent](sc, googleCloudStorageUrl)
val conf = sc.hadoopConfiguration
conf.set("mapred.bq.project.id", "myproject")
// Output parameters
val projectId = conf.get("fs.gs.project.id")
val outputDatasetId = "logs"
val outputTableId = "test"
val outputTableSchema = LogSchema.schema
// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
conf, projectId, outputDatasetId, outputTableId, outputTableSchema
)
conf.set(
"mapreduce.job.outputformat.class",
classOf[BigQueryOutputFormat[_, _]].getName
)
events
.mapPartitions {
items =>
val gson = new Gson()
items.map(e => gson.fromJson(e.toString, classOf[JsonObject]))
}
.map(x => (null, x))
.saveAsNewAPIHadoopDataset(conf)
As the BigQueryOutputFormat isn't finding the Google Credentials, it fallbacks on metadata host to try to discover them with the following stacktrace:
016-06-13 11:40:53 WARN HttpTransport:993 - exception thrown while executing request
java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589 at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.<init>(BigQueryOutputCommitter.java:70)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:102)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:84)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:30)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1135)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
It is of course expected but it should be able to use my service account and its key as GoogleCredential.getApplicationDefault() returns appropriate credentials fetched from GOOGLE_APPLICATION_CREDENTIALS environment variable.
As the connector seems to read credentials, from hadoop configuration, what's the keys to set so that it reads GOOGLE_APPLICATION_CREDENTIALS ? Is there a way to configure the output format to use a provided GoogleCredential object ?
If I understand your question correctly - you might want to set:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.email</name>
<name>mapred.bq.auth.service.account.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Here, the mapred.bq.auth.service.account.keyfile should point to the full file path to the older-style "P12" keyfile; alternatively, if you're using the newer "JSON" keyfiles, you should replace the "email" and "keyfile" entries with the single mapred.bq.auth.service.account.json.keyfile key:
<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.json.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>
Also you might want to take a look at https://github.com/spotify/spark-bigquery - which is much more civilised way of working with BQ and Spark. The setGcpJsonKeyFile method used in this case is the same JSON file you'd set for mapred.bq.auth.service.account.json.keyfile if using the BQ connector for Hadoop.

Required tag missing error on MarketDataRequest query

I have a FIX44.MarketDataRequest query based on Trade Client example from QuickFIX-n example applications. But I can't manage to get something useful from the server with this query. I always get the error Required tag missing for this query.
Here is my code:
MDReqID mdReqID = new MDReqID("MARKETDATAID");
SubscriptionRequestType subType = new SubscriptionRequestType(SubscriptionRequestType.SNAPSHOT);
MarketDepth marketDepth = new MarketDepth(0);
QuickFix.FIX44.MarketDataRequest.NoMDEntryTypesGroup marketDataEntryGroup = new QuickFix.FIX44.MarketDataRequest.NoMDEntryTypesGroup();
marketDataEntryGroup.Set(new MDEntryType(MDEntryType.BID));
QuickFix.FIX44.MarketDataRequest.NoRelatedSymGroup symbolGroup = new QuickFix.FIX44.MarketDataRequest.NoRelatedSymGroup();
symbolGroup.Set(new Symbol("EUR/USD"));
QuickFix.FIX44.MarketDataRequest message = new QuickFix.FIX44.MarketDataRequest(mdReqID, subType, marketDepth);
message.AddGroup(marketDataEntryGroup);
message.AddGroup(symbolGroup);
Here is the generated outbound application level message (ToApp):
8=FIX.4.49=15835=V34=249=order.DEMOSUCD.12332150=DEMOSUSD52=20141223-07:02:33.22656=demo.fxgrid128=DEMOSUSD262=MARKETDATAID263=0264=0267=1269=0146=155=EUR/USD10=232
Here is the received ToAdmin message:
8=FIX.4.49=14935=334=249=demo.fxgrid52=20141223-07:02:36.51056=order.DEMOSUCD.12332157=DEMOSUSD115=DEMOSUSD45=258=Required tag missing371=265372=V373=110=136
If I understand correctly the pair 371=265 (RefTagID=MDUpdateType) after 258=Required tag missing indicates which tag is missing, .i.e. is missing MDUpdateType. But this is strange because this tag is optional for MarketDataRequest.
UPDATE
Here is my FIX config file:
[DEFAULT]
FileStorePath=store
FileLogPath=log
ConnectionType=initiator
ReconnectInterval=60
CheckLatency=N
[SESSION]
BeginString=FIX.4.4
TargetCompID=demo.fxgrid
SenderCompID=XXX
SenderSubID=YYY
StartTime=00:00:00
EndTime=00:00:00
HeartBtInt=30
SocketConnectPort=ZZZ
SocketConnectHost=X.X.X.X
DataDictionary=FIX44.xml
ResetOnLogon=Y
ResetOnLogout=Y
ResetOnDisconnect=Y
Username=AAA
Password=BBB