Has anyone tried to lookup external API's using PySpark? - pyspark

Checking to see if anyone has used pyspark of distributing data and use the same to lookup an external API and store results.
I am working on this problem:
I have a source file with say 100000 rows of User Agents. I have to lookup an external API (using requests) and get characteristics of the User Agents and store the same. I was able to accomplish this with Queues and Threads in a reasonable manner.
However, I noticed that 100k row count might turn up to a million.
I was thinking if I could use Spark to distribute this data and perform this API lookup operation in parallel fashion using executors.
Has anyone accomplished this ?

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

How to overcome API/Websocket limitations with OHCL data for trading platform for lots of users?

I'm using CCXT for some API REST calls for information and websockets. It's OK for 1 user, if I wanted to have many users using the platform, How would I go about an inhouse solution?
Currently each chart is either using websockets or rest calls, if I have 20 charts then thats 20 calls, if I increase users, then thats 20x whatever users. If I get a complete coin list with realtime prices from 1 exchange, then that just slows everything down.
Some ideas I have thought about so far are:
Use proxies with REST/Websockets
Use timescale DB to store the data and serve that OR
Use caching on the server, and serve that to the users
Would this be a solution? There's got to be a way to over come rate limiting & reducing the amount of calls to the exchanges.
Probably, it's good to think about having separated layers to:
receive market data (a single connection that broadcast data to OHLC processors)
process OHLC histograms (subscribe to internal market data)
serve histogram data (subscribe to processed data)
The market data stream is huge, and if you think about these layers independently, it will make it easy to scale and even decouple the components later if necessary.
With timescale, you can build materialized views that will easily access and retrieve the information. Every materialized view can set a continuous aggregate policy based on the interval of the histograms.
Fetching all data all the time for all the users is not a good idea.
Pagination can help bring the visible histograms first and limit the query results to avoid heavy IO in the server with big chunks of memory.

Beam I/O connector (python) for SOAP and/or REST

We are building a data warehouse which combines data from a lot (n) of sources. Theses sources are made available to us in various ways like, csv files, direct database access, SOAP and REST.
.csv and direct access are covered extensively in the Apache Beam documentation, however there seems to be little (if any) coverage of SOAP and REST. We have no problem getting REST and SOAP data first, then use the data to instantiate a pCollection and let the pipeline handle it from there. However, a pattern is emerging that we think calls for a different approach:
Query an endpoint to get a list of other endpoints that serve the actual data.
Iterate over said list to retrieve the data
In some cases, the total amount of records retrieved like this will be in the tens of millions (possibly more in the future).
The question: How can we retrieve this data efficiently using Beam, making use of parallel processing? Do we need to write a custom python I/O connector and if so, where can we find an example of a SOAP or REST (or any http request really, We searched extensively but all we found was a single link and even that was stale)
Or alternatively: The lack of documentation and questions about this subject makes us think that Beam is in fact not the correct tool for this particular job. Is this correct?
Not sure if entirely relevant: We use Google Dataflow as a runner and Biqquery for storage (for now).

Spark Scala - processing different child dataframes parallely in bulk

I am working on a fraudulent transaction detection project which makes use of spark and primarily uses rule-based approach to risk score incoming transactions. For this rule based approach, several maps are created from the historical data to represent the various patterns in transactions and these are then used later while scoring the transaction. Due to rapid increase in data size, we are now modifying code to generate these maps at each account level.
earlier code was for eg.
createProfile(clientdata)
but now it becomes
accountList.map(account=>createProfile(clientData.filter(s"""account=${account}""")))
Using this approach , the profiles are generated but since this operations are happening sequentially , hence it doesn't seem to be feasible.
Also, createProfile function itself is making use of dataframes, sparkContext/SparkSessions hence, this is leading to the issueof not able to send these tasks to worker nodes as based on my understanding only driver can access the dataframes and sparkSession/sparkContext. Hence , the following code is not working
import sparkSession.implicit._
val accountListRdd=accountList.toSeq.toDF("accountNumber")
accountList.rdd.map(accountrow=>createProfile(clientData.filter(s"""account=${accountrow.get(0).toString}""")))
The above code is not working but represents the logic for the desired output behaviour.
Another approach, i am looking at is using multithreading at driver level using scala Future .But even in this scenario , there are many jvm objects being created in a single createProfile function call , so by increasing threads , even if this approach works , it can lead to a lot jvm objects, which itself canlead to garbage collection and memory overhead issues.
just to put timing perspective, createProfile takes about 10 min on average for a single account and we have 3000 accounts , so sequentially it will take many days. With multi threading even if we achieve a factor of 10 , it will take many days. So we need parallelism in the order of 100s .
One of things that could have worked in case it existed was ..lets say if there is Something like a spark groupBy within a groupBY kind of operation, where at first level we can group By "account" and then do other operations
(currently issue is UDF won't be able to handle the kind of operations we want to perform)
Another solution if practically possible is the way SPark Streaming works-
it has a forEachRDD method and also spark.streaming.concurrentjobs parameter which allows processing of multiple RDDs in parallel . I am not sure how it works but maybe that kind of implementation may help.
Above is the problem description and my current views on it.
Please let me know if anyone has any idea about this! Also ,I will prefer a logical change rather than suggestion of different technology

Kafka Interactive Queries - Accessing large data across instances

We are planning to run kafka streams application distributed in two machines. Each instance stores its Ktable data on its own machine.
The challenge we face here is,
We have a million records pushed to Ktable. We need to iterate the
whole Ktable (RocksDB) data and generate the report.
Let's say 500K records stored in each instance. It's not possible to get all records from other instance in a single GET over http
(unless there is any streaming TCP technique available) . Basically
we need two instance data in a single call and generate the report.
Proposed Solution:
We are thinking to have a shared location (state.dir) for these two instances.So that these two instances will store the Ktable data at same directory and the idea is to get all data from a single instance without interactive query by just calling,
final ReadOnlyKeyValueStore<Key, Result> allDataFromTwoInstance =
streams.store("result",
QueryableStoreTypes.<Key, Result>keyValueStore())
KeyValueIterator<Key, ReconResult> iterator = allDataFromTwoInstance.all();
while (iterator.hasNext()) {
//append to excel report
}
Question:
Will the above solution work without any issues? If not, is there any alternative solution for this?
Please suggest. Thanks in Advance
GlobalKTable is the most natural first choice, but it means each node where the global table is defined contains the entire dataset.
The other alternative that comes to mind is indeed to stream the data between the nodes on demand. This makes sense especially if creating the report is an infrequent operation or when the dataset cannot fit a single node. Basically, you can follow the documentation guidelines for querying remote Kafka Streams nodes here:
http://kafka.apache.org/0110/documentation/streams/developer-guide#streams_developer-guide_interactive-queries_discovery
and for RPC use a framework that supports streaming, e.g. akka-http.
Server-side streaming:
http://doc.akka.io/docs/akka-http/current/java/http/routing-dsl/source-streaming-support.html
Consuming a streaming response:
http://doc.akka.io/docs/akka-http/current/java/http/implications-of-streaming-http-entity.html#client-side-handling-of-streaming-http-entities
This will not work. Even if you have a shared state.dir, each instance only loads its own share/shard of the data and is not aware about the other data.
I think you should use GlobalKTable to get a full local copy of the data.