Projects¶
A project in Portia consists of one or more spiders and can be deployed to any scrapyd instance.
Versioning¶
Portia provides project versioning via Git, but this isn’t enabled by default.
Git versioning can be enabled by creating a local_settings.py file in the slyd/slyd directory and adding the following:
import os
SPEC_FACTORY = {
'PROJECT_SPEC': 'slyd.gitstorage.projectspec.ProjectSpec',
'PROJECT_MANAGER': 'slyd.gitstorage.projects.ProjectsManager',
'PARAMS': {
'storage_backend': 'dulwich.repo.Repo',
'location': os.environ.get('PORTIA_DATA_DIR', SPEC_DATA_DIR)
},
'CAPABILITIES': {
'version_control': True,
'create_projects': True,
'delete_projects': True,
'rename_projects': True
}
}
You can also use MySQL to store your project files in combination with Git:
import os
SPEC_FACTORY = {
'PROJECT_SPEC': 'slyd.gitstorage.projectspec.ProjectSpec',
'PROJECT_MANAGER': 'slyd.gitstorage.projects.ProjectsManager',
'PARAMS': {
'storage_backend': 'slyd.gitstorage.repo.MysqlRepo',
'location': os.environ.get('DB_URL'),
},
'CAPABILITIES': {
'version_control': True,
'create_projects': True,
'delete_projects': True,
'rename_projects': True
}
}
This will store versioned projects as blobs within the MySQL database that you specify by setting the environment variable below:
DB_URL = mysql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DB>
When this env variable is set the database can be initialized by running the bin/init_mysqldb script.
Note
The MySQL backend only stores project data. Data generated during crawl is still stored locally.
Deployment¶
You can deploy your Portia projects using scrapyd. Change directory into slyd/data/projects/PROJECT_NAME
and add your target to scrapy.cfg
. You’ll then be able to run scrapyd-deploy
which will deploy your project using the default deploy target. Alternatively, you can specify a target and project using the following:
scrapyd-deploy your_scrapyd_target -p project_name
Once your spider is deployed, you can schedule your spider via schedule.json
:
curl http://your_scrapyd_host:6800/schedule.json -d project=your_project_name -d spider=your_spider_name
Warning
Running scrapyd from your project directory will cause deployment to fail.