Interacting with Page β
Page lifecycle and timing β
jiant.get.html(
-> Promise: return copy of HTML.jiant.get.dom(
-> Promise: return copy of DOM.jiant.get.$(
-> Promise: return copy of HTML wrapped bycheerio.load
.jiant.action.pushURL(
-> Promisejiant.action.goBack(
-> Promise
Promises π up there will wait until target page's document.readyState
getting interactive
or complete
.
At start of checking, checking script always runs after target page's document.readyState
as well, if configure DISABLE_AUTO_OPEN_PAGE_URL
is not set.
When you use methods jiant.action.click
which will make target window reload, it is required to get new content again via html|dom|$
then wait it ready.
// at start of script
// It is already document.readyState now.
await jiant.action.click('ul.example-list li span.expand')
// It works.
let html = await jiant.get.html()
// Return html value instantly.
await jiant.action.click('ul.example-list li a')
// Click a link and target page starts redirecting.
await jiant.action.click('ul.other-list-on-new-page li span.expand')
// It does not work, as page is not document ready.
let dom = await jiant.get.dom()
// Waiting document.readyState.
await jiant.action.click('ul.other-list-on-new-page li span.expand')
// It works.
In lots of site, some parts of page may start loading async after DOM ready.
Some parts of page may start loading async after page scroll down.
You can use jiant.action.sleep
waiting some time before doing actions.
// at start of script
// It is already document.readyState now.
await jiant.action.click('ul.async-list li span.expand')
// It does not work if the part starts loading async after document ready.
await jiant.action.scroll(100, 621)
// make target page scroll-y down 621px, scroll-x right 100px.
await jiant.action.sleep(1000)
// Wait 1000 milliseconds for some async loading data. It depends.
await jiant.action.click('ul.async-list li span.expand')
// It works after the part loaded.
Parse inner pages β
We always track index list page and check for new updates, but some infomations are only in inner page like description...
We can parse links of inner pages, then push and parse one by one.
<!-- HTML of target page in left window -->
<html>
<body>
<ul class="example-list">
<li>
<a href="https://example.com/story001">
<div class="title">ε
Όε¬εζ</div>
</a>
</li>
<li>
<a href="https://example.com/story002">
<div class="title">εδΏ‘εζ</div>
</a>
</li>
</ul>
</body>
</html>
// script in side panel
const $ = await jiant.get.$()
let items = []
// get all links from index list page
let links = $('ul.example-list li')
.toArray()
.map(e => $(e).find('a').attr('href'))
for (const link of links) {
// find previous item with the same link
let prevItem = await jiant.get.prevItem({ link })
// if found, return the previous one and skip to avoid duplicate parsing.
if (prevItem) {
items.push(prevItem)
continue
}
// if not duplicate, push to the link
await jiant.action.pushURL(link)
// waiting for page loaded then get the page content
let n$ = await jiant.get.$()
// parsing more info on detail page
let title = n$('h2.title').first().text()
let description = n$('div.main-text').first().text()
items.push({
title, description, link
})
// not too frequent
await jiant.action.sleep(5000)
// try next link
}
return { items }
Reference β
jiant.get.html(noWait) -> Promise β
jiant.get.dom(noWait) -> Promise β
jiant.get.$(noWait) -> Promise β
Returns copy content of target page.
The Promise will wait until target page's document.readyState
getting interactive
or complete
- @param:
noWait
(boolean) Promise will resovle without waitingdocument.readyState
let htmlString = await jiant.get.html()
// return raw html string:
// <html><body>....</body></html>
let domInstance = await jiant.get.dom()
// domInstance is equal to:
// (new DOMParser()).parseFromString(htmlString, 'text/html')
let $ = await jiant.get.$()
// $ is equal to: cheerio.load(htmlString)
TIP
Cheerio is an HTML parser which API is similar to jQuery.
For more infomation about cheerio.load
, please refer to Cheerio documentation
jiant.get.pageURL() -> Promise β
Returns location.href of current page on left window.
jiant.get.targetPageURL() -> string β
Returns target page URL of task which set in task editing panel. If you want to get current page URL, use jiant.get.pageURL
.
jiant.get.prevItem({ contentHash, link, title }) -> Promise β
Returns object of previous locally saved item of the task with same params if found else null. It is useful in deduplication.
let link = 'https://example.com/story002'
let prevItem = await jiant.get.prevItem({ link })
if (prevItem) console.log(prevItem)
// output: {
// link: 'https://example.com/story002',
// id: '...'
// title: '...',
// ...
// }
jiant.get.configs() -> object β
Returns task configure (JSON) as an object.
jiant.get.customParams() -> object β
Returns customParams
of task configure (JSON) as an object.
jiant.get.rsshub(config: object) -> Promise β
Returns some methods compatible with RSSHub documentation.
const {
ctx, // partially compatible
load, // == cheerio.load
md5,
ofetch,
cache, // cache.tryGet
parseDate,
parseRelativeDate,
timezone,
} = await jiant.get.rsshub()
- @param
options.path
parse page URL to params and queries with path RegExp.
To be compatible with RSSHub's ctx.req.param
and ctx.req.query
, it is required to set path
in options.
Parsing page URL with path
is same to method jiant.parse.pathRegExp
// if current page URL is
// 'https://example.com/page/animal/cat?name=gaf'
const { ctx } = await jiant.get.rsshub({
path: '/page/:category/:tp',
})
const name = ctx.req.query('name')
// output: 'gaf'
const { category, tp } = ctx.req.param()
console.log(category, tp)
// output: 'animal' 'cat'
// ctx.req.param is same to:
const pageUrl = await jiant.get.pageURL()
const { category, tp } = jiant.parse.pathRegExp(
pageUrl,
'/page/:category/:tp'
)
jiant.action.click(cssSelector: string | string[], delay: int) -> Promise β
In target page, Get elements by query selectors, then click them one by one.
- @param:
cssSelector
can be a selector string or array of selector strings. - @param:
delay
is value of milliseconds to wait after every click. default is 0.
await jiant.action.click('span.expand', 100)
// In target page, click all span element with class expand one by one.
// Waiting for 100ms before next click.
await jiant.action.click(['span.expand', 'span.expand > button'], 10)
jiant.action.pushURL(url: string, noWait: false) -> Promise β
jiant.action.goBack(noWait: false) -> Promise β
@param: noWait
: if not waiting document.readyState
.
Let target page push to new URL or go back to previous one.
Promise will wait until target page loaded with document.readyState
jiant.action.sleep(duration: int) -> Promise β
- @param:
duration
is in millisecond.
// It is:
function sleep(ms) {
return new Promise((res) => setTimeout(res, ms))
}
jiant.action.scroll(x: int, y: int) -> Promise β
Scroll target page to (x, y) in px.
jiant.action.scrollDownSmooth(y: int, duration: int, steps: int) -> Promise β
Scroll down with total y
px during total duration
milliseconds.
The scroll action will be separated to steps
small actions.
Distance value and duration value of every small are slightly randomed.
- @param:
y
total distance to scroll down in px. default is 0. - @param:
duration
total time during scrolling in milliseconds. default is 0. - @param:
steps
default is 5.
await jiant.action.scrollDownSmooth(1000, 2000, 5)
// It is equal to following actions:
let totalDistance = 1000
let totalDuration = 2000
let steps = 5
// 1. separate 1000px distance to 5 steps in slightly random.
let distances = someFunctionSplitValueRandom(totalDistance, steps)
// -> [101, 202, 303, 222, 172]
// 2. sepatate 2000ms duration to 5 steps in sligtly random.
let durations = someFunctionSplitValueRandom(totalDuration, steps)
// -> [202, 404, 606, 444, 344]
// 3. run small actions
for (const i = 0; i < steps; i++) {
await jiant.action.scroll(0, distances[i])
await jiant.action.sleep(durations[i])
}
jiant.action.ofetch(...params) -> Promise β
@param: params
: refer to ofetch
Call ofetch in target page.
- The request will take default headers of target page like
cookies
,origin
... - The response only contains result data, no headers, status...
TIP
You can fetch
API in script as well. However it will not take any headers of target page, and it will also get restricted by server's CORS
rule.
jiant.action.ofetchCors(...params) -> Promise β
@param: params
: refer to ofetch
- The request will take empty headers without
origin
... - The response only contains result data, no headers, status...