An item refers to a single item of data scraped from the target website. A common example of an item would be a product for sale on an e-commerce website. It’s important to differentiate item and item definition. In Portia, an item definition or item type refers to the schema of an item rather than the item itself. For example, book would be an item definition, and a specific book scraped from the website would be an item. An item definition consists of multiple fields, so using the example of a product you might have fields named name, price, manufacturer and so on. We use annotations to extract data from the page into each of these fields.

To ensure certain fields are extracted, you can set the Required flag on each required field. Portia will discard an item if any required fields are missing. Portia will also remove any duplicate items by default.

In some cases you may have fields where the value can vary despite being the same item, in which case you can mark them as Vary. This will ignore the field when checking for duplicates. It’s important to only use Vary when necessary, as misuse could easily lead to duplicate items being stored. The url field is a good example of where Vary is useful, as the same item may have multiple URLs. If the url field wasn’t marked as Vary, each duplicate item would be seen as unique because its URL would be different.

Field types

You can set a field’s type to ensure it will only match that kind of data. The following field types are available:

type description
text Plain text. Any markup is stripped and text within nested elements is also extracted.
number A numeric value e.g. 7, 9.59.
image An image URL. In most cases you will want to map an img element’s src attribute.
price The same as number, a numeric value.
raw html Non-sanitized HTML.
safe html Sanitized HTML. See below for more details.
geopoint The same as text.
url A URL.
date A date value parsed by dateparser. Won’t work if the annotated element has non-date text.

The safe html field type keeps the following elements: br, p, big, em, small, strong, sub, sup, ins, del, code, kbd, samp, tt, var, pre, listing, plaintext, abbr, acronym, address, bdo, blockquote, q, cite, dfn, table, tr, th, td, tbody, ul, ol, li, dl, dd, dt.

All other elements are discarded, with the exception of header tags (h1, h2h6) and b which are replaced with strong, and i which is replaced with em. Whitelisted elements contained within non-whitelisted elements will still be retained, with the exception of elements contained within a script, img or input element. For example, <div><code>example</code></div> would extract to <code>example</code>, whereas <script><code>example</code></script> would be discarded completely.