Deduplication
Built-in deduplication
As a monitoring task runs in interval after you start continuous checking, it is certain that duplicate items will be fetched.
Before saving fetched items, there is a duplication check on value of contentHash
.
contentHash
is automatically generated from title
, link
, descirption
, if not set manually.
If found previous saved item which has same contentHash
by same monitoring task, the new one will be skipped when saving.
Manually skip
Even with built-in deduplication, repeated parsing will still waste your time and resource even get banned by website's frequency restriction.
We strong recommend you skipping duplicate item in script manually with jiant.get.prevItem
.
if found duplicate item, use the previous one and skip parsing.
WARNING
Do not forget commenting out it when debugging, otherwise it will always load previous saved items.
// script in side panel
const $ = await jiant.get.$()
let items = []
// get all links from index list page
let links = $('ul.example-list li')
.toArray()
.map(e => $(e).find('a').attr('href'))
for (const link of links) {
// find previous item with the same link
let prevItem = await jiant.get.prevItem({ link })
// if found, return the previous one and skip to avoid duplicate parsing.
if (prevItem) {
items.push(prevItem)
continue
}
// if not found, push to the detail page
await jiant.action.pushURL(link)
// waiting for page loaded then get the page content
let n$ = await jiant.get.$()
// parsing more info on detail page
let title = n$('h2.title').first().text()
let description = n$('div.main-text').first().text()
items.push({
title, description, link
})
// try next link
}
return { items }
Disable built-in deduplication
Sometime if you want to save duplicate item, you can set DUPLICATE_CHECK_IN_MS
in configure (JSON) for specific task.