how can i use scrapy to finish a distributed scraper with celery? - celery

Now i want to complete a distributed scraper with scrapy and celery,my current idea is to use master-slave method,can someone tell me is that a good idea?is there a good open-source project for this?

When i implemented a distributed crawling set-up.I achieved that with the help of redis. Here is how i did it.
I have a list of domains to be crawled upon.I will upload those domains to redis.In my project, i had 30K domains to scrape data from.
Use redis-py client to talk to redis, and feed the each url to scrapy.

Related

How to create a form with an url input that redirects to pagespeed score / insights or displays it with ajax

Is it possible to do this? Ideally to return the report in the very same page with ajax?
Example the user adds www.mywebsite.com to the field and then the report of pagespeed is returned. If not possible then redirect to Pagespeed result page.
You have a few options here. Starting from easiest to hardest (and in my opinion "worst" to "best" solution).
Add the Page Speed Insights (PSI) test page to an iframe on your site. You can then change the URL of that iframe to https://developers.google.com/speed/pagespeed/insights/?url=yourwebsite.com and manipulate the ?url=yourwebsite.com to be whatever you want.
This may be against Google's terms of service and is also a bad user experience but it is the easiest way to achieve it. I will leave you to investigate that option if you decide to do it.
Redirect users to a new tab. So just do <a target="_blank" href="https://developers.google.com/speed/pagespeed/insights/?url=yourwebsite.com">view your report</a> or redirect via JS on a button click.
Yet again not a great option as people are leaving your site but at least this won't be against Google's terms of service.
Use the page speed insights API. https://developers.google.com/speed/docs/insights/v5/get-started.
This is your best option in terms of time vs flexibility. You supply the API with the URL and it returns a JSON response with all of the metrics it gathers and the scoring.
Please note PSI is on version 6 of the API which should be available for general use soon.
Obviously this is a lot more work but well worth the effort as you can style everything as you please.
Install Lighthouse, the engine that drives PSI on your own server.
You can find the Lighthouse repository here. Please note you need to know how to use node, it is useful to understand puppeteer and you need a reasonable amount of server admin knowledge to get chromium (used as a headless web browser for running the tests) working and linked correctly.
At this stage you have complete control and can write your own test, scoring criteria etc. You can also run as many tests as your server will allow. If you want this level of control and freedom then this is the best option. However be prepared to sink a lot of hours into this solution!

Fully scalable website with micro-applications

I'm in the process of designing a cloud deployed website for a new solution my company is looking to provide. I have been attempting to answer a few questions and haven't had any luck, so when in rome.
First, I don't want the website to be stuck to any one particular framework. I know there is no way to completely future proof a website, but I would rather not put all of our eggs in one basket.
Secondly, I want to have a separation between the front and back end entirely. I have a list of reasons why I'm looking to do this, don't necessarily want to get into the conversation of what they are. Server Side rendering for the most part is out of the question.
So where does that leave me?
My initial thoughts on the design are to have a REST API that can be accessed for any API calls (this may be turned to GraphQL in the future).
The design decisions that I'm mostly wresting with are for the front end. The website will be a dashboard type system, where tenants can log in and see screens for them.
I was thinking that I would have a sort of shell, that hooks on to the index.html. This would have it's own routing, that would render micro-applications that are completely separate from the shell logic.
So for example, if I load index.html, path being "/"
It has some routes that it's responsible for, lets say
"/todos"
"/account"
If I accessed the /todos route, my shell application would then render that micro app. This application would be completely separate from the shell, except some data that might be loaded via the window. Once this application is rendered via the shell application.
So my todos route, for example, could be a redux application that's independent. It could have it's own routing, etc.
Is this is a common architecture? Are there any examples of this? Is there a better way of going about this?
Thanks for any insight!
Sounds like your well and truly over engineering this beast.
You may take on such an architecture for a HUGE build with many dev teams all working separately. Small agile team, the above would create so much overhead in boilerplate and brain ache in context switching between each "app"
Micro-service architecture is seriously great. Just don't break it up too small, read your use case well and break your services up accordingly.
For example: we are a team of 3. We have a pretty large-ish app devised into:
Php API
Backend management interface (redux)
Frontend website (html, react, php)
Search service (elastic search)
Cache (redis)
Data store (mysql)
All on running in multiple docker containers across multiple hosts. Pull down the backend.. Fine the frontend website is still up and running!

How to host audio files for a web application

I'm planning to make web application which allows users to upload music/audio files and host them etc, i'm wondering what the best method would be to go about this, i have used cloudinary in previous projects for image hosting but nothing for audio.
What do companies like Soundcloud use if not there own service which i am assuming is the case.
What would you recommend? It will be vital when it comes to building a scalable and reliable service so I don't want to go into this project uneducated.
ps. I will be using meteor and mongodb to build the application.
I'd recommend getting started with edgee:slingshot in your app. It's much lighter on your Meteor server since uploads and downloads go straight to the storage system. There you have several choices including S3, Google Cloud Storage, and Rackspace Cloud. You could also use CollectionFS but slingshot seems architecturally better suited to this class of problem.

Best scalable model for a website serving millions of users everyday

I want to develop a website that will serve millions of pages everyday including the mobile devices. Site will have strong social features and thus would require lots of reads/writes. It will also suggest things to users based on their social behaviors (likes, dislikes etc) and their friends' behaviors. After considering many elements I have come up with
NoSQL (MongoDB or Cassandra) Database. Not sure which one is the right one.
memcached
Varnish or squid for http acceleration
php and python (Not sure if php is that scalable)
nginx or Apache web server
Any recommendations?
There are NoSQL databases that has an integrated web service that can handle much more web requests per second (including database transaction time) compared to traditional web services requesting data from an external data source. Using this kind of solution increases the performance, save a lot of time in implementation and simplify scaling your website.
The recommendation depends on how you plan on implementing the solution: a server side rendering solution or a client rendered solution? Will you have any MVVM style implementation making the communication talkative? Also what server side environment do you have in mind? Microsoft/Linux?
Take a look at Starcounter database that has a web server component integrated into the database engine and see if that could help you.

How do I sync an offline web app (HTML+JS+CSS) with my server?

Do I need to implement my own sync methods in order to make an offline web app (html+css+js) stay up to date with changes made on the server (and viceversa)? I'm using MySQL on the server side.
I read Two-way sync between iPhone application and web application with some pointers but I think they're talking about native applications when they mention CFUUIDCreate and I wander if this is possible for the Web.
Does someone have some code to share or maybe can point me in the right direction?
Thank you!
P.S.: I hope my english is not that rusty ;)
To store static contents on the client-side, as Jethro Larson said, the Application Cache Manifest is the way to go to cache the static contents of your website (HTML, CSS, JS and images).
To handle dynamically generated contents offline, you can use javascript templates. There are several solutions for this.
To sync the two databases, there is a project called persistence.js (persistencejs.org) which is a javascript library which offers a unique API to work with WebSQL databases, Local Storage, etc. They have a plugin for this library called persistence.sync (persistencejs.org/plugin/sync) which syncs the remote database with the server's one. It consists of POST and GET requests to a specific url that you can configure (for example yourapp.dev/sync). They have an example back-end written in node.js and here is one for Rails. It's simple to understand and persistence.sync is well documented.
Look at the offline cache:
http://www.webreference.com/authoring/languages/html/HTML5-Application-Caching/
http://www.google.com/search?q=offline+cache+html5
http://www.slideshare.net/search/slideshow?q=offline+cache