Spiders are web crawlers that use samples to extract data from the pages it visits.
You can access your spider’s properties by clicking the gear icon located right of your spider in the list on the left.
Configuring login details¶
If you need to log into a site, you can configure login details by ticking ‘Perform login’ in the spider properties menu. Here you can set the login URL, username and password.
Running a spider¶
Portia will save your projects in
slyd/data/projects. You can use
portiacrawl to run a spider:
portiacrawl PROJECT_PATH SPIDER_NAME
PROJECT_PATH is the path of the project and
SPIDER_NAME is a spider that exists within that project. You can list the spiders for a project with the following:
Portia spiders are ultimately Scrapy spiders. You can pass Scrapy arguments when running with
portiacrawl using the
-a option. You can also specify a custom settings module using the
--settings option. The Scrapy documentation contains full details on available options and settings.
Minimum items threshold¶
To avoid infinite crawling loops, Portia spiders check to see if the number of scraped items meet a minimum threshold over a given period of time. If not, the job is closed with
By default, the period of time is 3600 seconds and the threshold is 200 items scraped. This means if less than 200 items were scraped in the last 3600 seconds, the job will close.
You can set the period in seconds with the
SLYCLOSE_SPIDER_CHECK_PERIOD setting, and the threshold number of items with the