Local dataset in Federated learning: client side, is the local update performed on a different subset of the local dataset each round? - server

I was wondering if in a Federated Learning approach I need to split the local dataset in a number of batches equal to the number of communication rounds.
Otherwise I need to update locally on the whole local dataset each round.
Building a federated learning model

it depends on what you want to do. Federated learning is not a fixed method but a flexible approach that changes from solution to another and architecture to another i will try to make it clear by giving examples.
In Google keyboard for example they collect data in real-time so in each round there will be new data so in this case they are probably using the whole data for the update.
In another use-case you may have a huge local dataset and it takes ages to retrain the model locally so in this case you can train a subset in each round to reduce computation power and time needed to retrain the model.
Finally Federated leaning still have lot of challenges use it when it is really an obligation otherwise just adopt the normal centralized approach to train your model :)

Related

Should I store every data point or only changes in offline store for Feast offline feature retrieval?

I am implementing a Feature Engineering & Feature store solution with Feast on GCP.
I am using Bigquery for offline storage.
I have a question: say I have a feature on a user entity that does not change frequently (for example address). I of course intend to use Feast to build a training dataset and the point in time joint functionality. In that case I seem to have 2 options:
Saving at a given frequency, (lets say every hour) the address for all my users in the BQ table even if there is no change in the feature value compared to the previous one stored, having a lot of duplicates
Saving only changes in the features, with potentially important gaps and sparsity in the storage.
The second option seems the most adequate since we would not store too many duplicate data points. However I know there is an argument ttl on feast FeatureView object which in my understanding sets the number of days that feast will use to search for feature values when using get_historical_features. Thus for a data with large sparsity such as user location I may need to set a very high ttl value, which may have performance & cost impacts according to Feast documentation.
What is the way to approach this problem please?

Is this scenario a big data project?

i'm involved in a project with 2 phases and i'm wondering if this is a big data project (i'm newbie in this field)
In the first phase i have this scenario:
i have to collect huge amont of data
i need to store them
i need to build a web application that shows data to the users
In the second phase i need to analyze stored data and builds report and do some analysis on them
Some example about data quantity; in one day i may need to collect and store around 86.400.000 record
Now i was thinking to this kind of architecture:
to colect data some asynchronous tecnology like Active MQ and MQTT protocol
to store data i was thinking about a NoSQL DB (mongo, Hbase or other)
Now this would solve my first phase problems
But what about the second phase?
I was thinking about some big data SW (like hadoop or spark) and some machine learning SW; so i can retrieve data from the DB, analyze them and build or store in a better way in order to build good reports and do some specific analysis
I was wondering if this is the best approach
How would you solve this kind of scenario? Am I in the right way?
thank you
Angelo
As answered by siddhartha, whether your project can be tagged as bigdata project or not, depends on context and buiseness domain/case of your project.
Coming to tech stack, each of the technology you mentioned has specific purpose. For example if you have structured data, you can use any new age base database with query support. NoSQL databases come in different flavours (columner, document based, key-value, etc), so technology choice depends again on the kind of data and use-case that you have. I suggest you to do some POCs and analysis of technologies before taking final calls.
Definition of big data varies from user to user. For Google 100 TB might be a small data but for me this is big data because of difference in available Hardware commodity. Ex -> Google can have cluster of 50000 nodes each node having 64 GB Ram for analysing 100 Tb of data so for them this not big data. But I cannot have cluster of 50000 node so for me it is big data.
Same is your case if have commodity hardware available you can go ahead with hadoop. As you have not mentioned size of file you are generating each day I cannot be certain about your case. But hadoop is always a good choice to process your data because of new projects like spark which can help you process data in much less time and moreover it also give you features of real time analysis. So according to me it is better if you can use spark or hadoop because then you can play with your data. Moreover since you want to use nosql database you can use hbase which is available with hadoop to store your data.
Hope this answers your question.

Complex queries vs storing more data

I'm having an hard time deciding what's the better approach when figuring how what I want to have in my application. I mostly use MongoDB, and work on web applications, if it can make the answer more specific.
I wonder what will be a better approach:
Store as little data as possible in Mongo's collections, and implement my different features mainly with server side logic and calculations against the stored data.
Store anything I need for a feature, to avoid complex queries and logic in server side functions, while filling my Mongo with lots of information.
Because of lack of experince (I'm only beginning my way as web developer), I believe, I can't figure out which approach should I take. I thought about taking the second one while prototyping, but switching while starting to scale might be too much effort. Any suggestions?
I think you're missing a 3rd option:
Store little data in the collection, do calculations on the client.
The beauty of meteor comes with the tech shift to thick clients. Back in the day, primarily prior to V8, browsers couldn't do much. Now, they can do a crazy amount of work. By pushing the work to your clients, you take processing workload off your server, you don't have to save calculated fields to your DB, and often times client processing time < server processing time + transmission time.
If you can't do option #3, lean towards #2. Storage is generally cheaper (and definitely faster) than processing power, but it depends on the complexity of the query. If storage cost < processing cost, store it.
array.toString()? calculate it. Neural network simulation? store it.

Enterprise integration via a data warehouse, or via messages?

Imagine a large organisation with many applications. The applications are not currently integrated to any great extent. There is a new and empty enterprise data warehouse, and it would store all data in a canonical format. The first step is to set up the warehouse and seed it with data from the applications.
I am looking for pros and cons between the following two enterprise integration patterns:
1) Using a combination of integration tools, setup batching to extract transform and load data on a periodic interval into the warehouse. Then, as part of the process, integrate the data from the warehouse to the required applications.
2) Using a combination of integration tools, detect changes real-time, or in batch and publish them to a service bus (in canonical format). Then, for each required application, subscribe to the messages to integrate them. The data warehouse is another subscriber to the same messages.
Thanks in advance.
One aspect that is hard to get right with integration-via-messages is periodic datasets.
Say you have a table in your data warehouse (DW) that contains data partitioned by day. If an ETL job loads that table, you can be sure that if the load job is finished, the respective dataset is complete (unless there's a bug in the job).
Messaging systems, on the other hand, usually don't provide guarantees of timely delivery. So you might get 90% of messages for a particular day by midnight, 8% within the next hour, and the remaining 2% within the next 6 hours (and a few messages might never arrive). In this situation, if you have a job that depends on this data, how can you know that the dataset is ready? You can set an arbitrary cutoff time (e.g. 1 hour past midnight) based on previous experience, SLAs, or some other criteria, when you consider the dataset complete, but that will by design be an approximation. You will also need some means to detect missing data (because of lost messages) and re-request it from the source.
This answer talks about similar problems.
Another issue is backfills. Imagine your source sends a backdated message, for example to correct some previously-sent one that belongs to a dataset in the past. Presumably, any consumers of that dataset need to be notified of the change and recompute their results. However, without some additional logic in the DW they might not know about it. With the ETL approach, since you already have dependencies between jobs, if you rerun some job with a backfill date, its dependencies will run automatically, or at least it'll be explicitly known that some consumers are affected.
With these caveats in mind, the messaging approach has some great advantages:
all your systems will be integrated using a uniform approach
the propagation time for your data will potentially be much lower
you won't have to fix ETL jobs that exploded because the data volume has grown past their ability to scale
you won't get SLA violations because your ETL jobs timed out
I guess you are talking about both ETL Systems and Mediation (intra-communication) design pattern. I don't know why have to choose between them, in my current project we combine them.
The ETL solution is implemented as Layer responsible for management of the Data integration (via Orchestrator module). It a single entry point and part of the Pipes and filters design pattern
concept that we rely on. It's able to perform a variety of tasks of varying complexity on the information that it processes.
On the other hand the Mediation as EAI system acts as "broker" between multiple applications. Whenever an interesting event occurs in an application (for instance, new information is created or a new transaction completed) an integration module in the EAI system is notified. The module then propagates the changes to other relevant applications.
So as bottom line I can't give you pros & cons for both, since to me they are a good solution together and their use is dependent on your goals, design etc.. But from your description it's seems to me that is similar to what I've suggested.

What are the real-time compute solutions that can take raw semistructured data as input?

Are there any technologies that can take raw semi-structured, schema-less big data input (say from HDFS or S3), perform near-real-time computation on it, and generate output that can be queried or plugged in to BI tools?
If not, is anyone at least working on it for release in the next year or two?
There are some solutions with big semistructured input and queried output, but they are usually
unique
expensive
secret enough
If you are able to avoid direct computations using neural networks or expert systems, you will be close enough to low latency system. All you need is a team of brilliant mathematicians to make a model of your problem, a team of programmers to realize it in code and some cash to buy servers and get needed input/output channels for them.
Have you taken a look at Splunk? We use it to analyze Windows Event Logs and Splunk does an excellent job indexing this information to allow for fast querying of any string that appears in the data.