An item refers to a single item of data scraped from the target website. A common example of an item would be a product for sale on an e-commerce website. It’s important to differentiate item and item definition. In Portia, an item definition or item type refers to the schema of an item rather than the item itself. For example,
book would be an item definition, and a specific book scraped from the website would be an item. An item definition consists of multiple fields, so using the example of a product you might have fields named
manufacturer and so on. We use annotations to extract data from the page into each of these fields.
To ensure certain fields are extracted, you can set the Required flag on each required field. Portia will discard an item if any required fields are missing. Portia will also remove any duplicate items by default.
In some cases you may have fields where the value can vary despite being the same item, in which case you can mark them as Vary. This will ignore the field when checking for duplicates. It’s important to only use Vary when necessary, as misuse could easily lead to duplicate items being stored. The
url field is a good example of where Vary is useful, as the same item may have multiple URLs. If the
url field wasn’t marked as Vary, each duplicate item would be seen as unique because its URL would be different.
You can set a field’s type to ensure it will only match that kind of data. The following field types are available:
|text||Plain text. Any markup is stripped and text within nested elements is also extracted.|
|number||A numeric value e.g. 7, 9.59.|
|image||An image URL. In most cases you will want to map an
|price||The same as
|raw html||Non-sanitized HTML.|
|safe html||Sanitized HTML. See below for more details.|
|geopoint||The same as
|date||A date value parsed by dateparser. Won’t work if the annotated element has non-date text.|
safe html field type keeps the following elements:
All other elements are discarded, with the exception of header tags (
b which are replaced with
i which is replaced with
em. Whitelisted elements contained within non-whitelisted elements will still be retained, with the exception of elements contained within a
input element. For example,
<div><code>example</code></div> would extract to
<script><code>example</code></script> would be discarded completely.