Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/759
This commit adds code to rely less on the state of the
cache storage to decide whether filter lists should be
re-compiled or whether the selfie is currently valid
at launch time when a change in compiled/selfie format
is detected.
Related issues:
- https://github.com/uBlockOrigin/uBlock-issues/issues/761
- https://github.com/uBlockOrigin/uBlock-issues/issues/528
The previous bidi-trie code could only hold filters which
are plain pattern, i.e. no wildcard characters, and which
had no origin option (`domain=`), right and/or left anchor,
and no `csp=` option.
Example of filters that could be moved into a bidi-trie
data structure:
&ad_box_
/w/d/capu.php?z=$script,third-party
||liveonlinetv247.com/images/muvixx-150x50-watch-now-in-hd-play-btn.gif
Examples of filters that could NOT be moved to a bidi-trie:
-adap.$domain=~l-adap.org
/tsc.php?*&ses=
||ibsrv.net/*forumsponsor$domain=[...]
@@||imgspice.com/jquery.cookie.js|$script
||view.atdmt.com^*/iview/$third-party
||postimg.cc/image/$csp=[...]
Ideally the filters above should be able to be moved to a
bidi-trie since they are basically plain patterns, or at
least partially moved to a bidi-trie when there is only a
single wildcard (i.e. made of two plain patterns).
Also, there were two distinct bidi-tries in which
plain-pattern filters can be moved to: one for patterns
without hostname anchoring and another one for patterns
with hostname-anchoring. This was required because the
hostname-anchored patterns have an extra condition which
is outside the bidi-trie knowledge.
This commit expands the number of filters which can be
stored in the bidi-trie, and also remove the need to
use two distinct bidi-tries.
- Added ability to associate a pattern with an integer
in the bidi-trie [1].
- The bidi-trie match code passes this externally
provided integer when calling an externally
provided method used for testing extra conditions
that may be present for a plain pattern found to
be matching in the bidi-trie.
- Decomposed existing filters into smaller logical units:
- FilterPlainLeftAnchored =>
FilterPatternPlain +
FilterAnchorLeft
- FilterPlainRightAnchored =>
FilterPatternPlain +
FilterAnchorRight
- FilterExactMatch =>
FilterPatternPlain +
FilterAnchorLeft +
FilterAnchorRight
- FilterPlainHnAnchored =>
FilterPatternPlain +
FilterAnchorHn
- FilterWildcard1 =>
FilterPatternPlain + [
FilterPatternLeft or
FilterPatternRight
]
- FilterWildcard1HnAnchored =>
FilterPatternPlain + [
FilterPatternLeft or
FilterPatternRight
] +
FilterAnchorHn
- FilterGenericHnAnchored =>
FilterPatternGeneric +
FilterAnchorHn
- FilterGenericHnAndRightAnchored =>
FilterPatternGeneric +
FilterAnchorRight +
FilterAnchorHn
- FilterOriginMixedSet =>
FilterOriginMissSet +
FilterOriginHitSet
- Instances of FilterOrigin[...], FilterDataHolder
can also be added to a composite filter to
represent `domain=` and `csp=` options.
- Added a new filter class, FilterComposite, for
filters which are a combination of two or more
logical units. A FilterComposite instance is a
match when *all* filters composing it are a
match.
Since filters are now encoded into combination of
smaller units, it becomes possible to extract the
FilterPatternPlain component and store it in the
bidi-trie, and use the integer as a handle for the
remaining extra conditions, if any.
Since a single pattern in the bidi-trie may be a
component for different filters, the associated
integer points to a sequence of extra conditions,
and a match occurs as soon as one of the extra
conditions (which may itself be a sequence of
conditions) is fulfilled.
Decomposing filters which are currently single
instance into sequences of smaller logical filters
means increasing the storage and CPU overhead when
evaluating such filters. The CPU overhead is
compensated by the fact that more filters can now
moved into the bidi-trie, where the first match is
efficiently evaluated. The extra conditions have to
be evaluated if and only if there is a match in the
bidi-trie.
The storage overhead is compensated by the
bidi-trie's intrinsic nature of merging similar
patterns.
Furthermore, the storage overhead is reduced by no
longer using JavaScript array to store collection
of filters (which is what FilterComposite is):
the same technique used in [2] is imported to store
sequences of filters.
A sequence of filters is a sequence of integer pairs
where the first integer is an index to an actual
filter instance stored in a global array of filters
(`filterUnits`), while the second integer is an index
to the next pair in the sequence -- which means all
sequences of filters are encoded in one single array
of integers (`filterSequences` => Uint32Array). As
a result, a sequence of filters can be represented by
one single integer -- an index to the first pair --
regardless of the number of filters in the sequence.
This representation is further leveraged to replace
the use of JavaScript array in FilterBucket [3],
which used a JavaScript array to store collection
of filters. Doing so means there is no more need for
FilterPair [4], which purpose was to be a lightweight
representation when there was only two filters in a
collection.
As a result of the above changes, the map of `token`
(integer) => filter instance (object) used to
associate tokens to filters or collections of filters
is replaced with a more efficient map of `token`
(integer) to filter unit index (integer) to lookup a
filter object from the global `filterUnits` array.
Another consequence of using one single global
array to store all filter instances means we can reuse
existing instances when a logical filter instance is
parameter-less, which is the case for FilterAnchorLeft,
FilterAnchorRight, FilterAnchorHn, the index to these
single instances is reused where needed.
`urlTokenizer` now stores the character codes of the
scanned URL into a bidi-trie buffer, for reuse when
string matching methods are called.
New method: `tokenHistogram()`, used to generate
histograms of occurrences of token extracted from URLs
in built-in benchmark. The top results of the "miss"
histogram are used as "bad tokens", i.e. tokens to
avoid if possible when compiling filter lists.
All plain pattern strings are now stored in the
bidi-trie memory buffer, regardless of whether they
will be used in the trie proper or not.
Three methods have been added to the bidi-trie to test
stored string against the URL which is also stored in
then bidi-trie.
FilterParser is now instanciated on demand and
released when no longer used.
***
[1] 135a45a878/src/js/strie.js (L120)
[2] e94024d350
[3] 135a45a878/src/js/static-net-filtering.js (L1630)
[4] 135a45a878/src/js/static-net-filtering.js (L1566)
Reported by:
- https://github.com/uBlock-user:
Imported custom list were incorrectly seen as out of
date immediately after import operation.
Regression from:
- e27328f931
A few lines of code were improperly removed during
refactoring.
Also, coallesce calls to selfieManager.destroy() so as
to avoid undue repeated calls to storage deletion of
selfie assets.
Related commit:
- e27328f931
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/710
Messages from unprivileged ports (i.e. from content scripts)
are no longer relayed to message handlers which are to be
strictly used to execute privileged code.
The last remaining case of unprivileged messages which
should be converted into a privileged ones will be taken
care of when the following issue is fixed:
- https://github.com/gorhill/uBlock/issues/3497
Related feedback:
- https://www.reddit.com/r/uBlockOrigin/comments/cmh910/
Additionally, the `3p` rule has been made distinct from
`3p-script`/`3p-frame` for the purpose of
"Relax blocking mode" command.
The badge color will hint at the current blocking mode.
There are four colors for the four following blocking
modes:
- JavaScript wholly disabled
- All 3rd parties blocked
- 3rd-party scripts and frames blocked
- None of the above
The default badge color will be used when JavaScript is not
wholly disabled and when there are no rules for `3p`,
`3p-script` or `3p-frame`.
A new advanced setting has been added to let the user choose
the badge colors for the various blocking modes,
`blockingProfileColors`. The value *must* be a sequence of
4 valid CSS color values that match 6 hexadecimal digits
prefixed with`#` -- anything else will be ignored.
Since resources are now immutable, by default they are
only compiled once each time uBO updates to a new
version. However I need a way to force a re-compiling
of the resource in the dev build. This commit adds code
to invalidate the resources selfie when forcing the
update of any filter list.
With hindsight, I revised decisions made earlier during
this development cycle:
Un-redirectable scriptlets have been removed from
/web_accessible_resources and instead put in the new
/assets/resources/scriptlets.js, which contains all
scriptlets used for web page injection purpose.
uBO will no longer fetch a remote version of built-in
resources.
Advanced setting `userResourcesLocation` will still be
honoured by uBO, and if set, will be fetched every
time at least one asset is updated.
This is a first step, the ultimate goal is to remove
the need for resources.txt, or at least to reduce to
only hotfixes or for trivial resources targeting very
specific websites.
Most resources will become immutable, i.e. they will
be part of uBO's code base. Advantages include easier
code maintenance (jshint, syntax highlight), and to
make scriptlets more easy to code review by external
parties (for example extension store reviewers).
TODO:
- More scriptlets need to be imported before next
release.
- Need to make legacy versions of uBO use a legacy
version of resources.txt, as all the now obsolete
scriptlets will have to be removed once uBO's
next release become widespread.
- Possibly need to add code to load binary
resources so that they can be injected as
data: URI. So far it's unclear whether this is
really needed. For example, this would be needed
if a xmlhttprequest is redirected to an image
resource.
This works only for platforms supporting the return of
Promise by network listeners, i.e. only Firefox at this
point.
When filter lists are reloaded[1], there is a small
time window in which some network requests which should
have normally been blocked are not being blocked
because the static network filtering engine may not
have yet loaded all the filters in memory
This is now addressed by suspending the network request
handler when filter lists are reloaded -- again, this
works only on supported platforms.
[1] Examples: when a filter list update session
completes; when user filters change, when
adding/removing filter lists.
The purpose of using a custom base128 encoder is to
convert array buffers into strings, to allow a direct
string-to-array buffer conversion at load time:
string => array buffer
Whereas a JSON array would require an extra step:
JSON array as string => JS array => array buffer
Turns out that the current use of a custom base128 encoding
results in a significantly larger selfie storage usage when
converting array buffers into strings.
Speculation: possibly the browser convert the strings to
save into JSON strings internally. Since the custom base128
encoder is likely to cause the resulting string to contain
a lot of unprintable ASCII characters, these will need to
be escaped when converted to JSON -- escaped characters
occupy more space than non-escaped ones.
Using a sequence of base 64 numbers means only printable
will be present in the output string, hence no escaping
necessary. I have observed significant reduction in
storage usage for selfie purpose.
The value of `suspendTabsUntilReady` was disregarded in Firefox and
uBO defaulted to always defer tab loading until it was ready.
This commit allows to disable the deferring of tab loading in
Firefox. The new valid values for `suspendTabsUntilReady` are:
- `unset`: leave it to the platform to pick the optimal
behavior (default)
- `no`: do no suspend tab loading at launch time
- `yes`: suspend tab loading at launch time
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/409
By default `indexedDB` is used in Firefox for purpose of cache storage
backend.
This commit allows to force the use of `browser.storage.local` instead
as cache storage backend. For this to happen, set `cacheStorageAPI` to
`browser.storage.local` in advanced settings.
Additionally, should `indexedDB` not be available for whatever reason,
uBO will automatically fallback to `browser.storage.local`.
The motivation is to address the higher peak memory usage at launch
time with 3rd-gen HNTrie when a selfie was present.
The selfie generation prior to this change was to collect all
filtering data into a single data structure, and then to serialize
that whole structure at once into storage (using JSON.stringify).
However, HNTrie serialization requires that a large UintArray32 be
converted into a plain JS array, which itslef would be indirectly
converted into a JSON string. This was the main reason why peak
memory usage would be higher at launch from selfie, since the JSON
string would need to be wholly unserialized into JS objects, which
themselves would need to be converted into more specialized data
structures (like that Uint32Array one).
The solution to lower peak memory usage at launch is to refactor
selfie generation to allow a more piecemeal approach: each filtering
component is given the ability to serialize itself rather than to be
forced to be embedded in the master selfie. With this approach, the
HNTrie buffer can now serialize to its own storage by converting the
buffer data directly into a string which can be directly sent to
storage. This avoiding expensive intermediate steps such as
converting into a JS array and then to a JSON string.
As part of the refactoring, there was also opportunistic code
upgrade to ES6 and Promise (eventually all of uBO's code will be
proper ES6).
Additionally, the polyfill to bring getBytesInUse() to Firefox has
been revisited to replace the rather expensive previous
implementation with an implementation with virtually no overhead.
Related issue:
- https://github.com/uBlockOrigin/uBlock-issues/issues/399
The advanced setting `cacheStorageAPI` has been added to allow
a user to force the use of IndexedDB as cache storage. Set to
`IndexedDB` to force use of IndexedDB. Default to `unset`.
Related issues:
- https://github.com/uBlockOrigin/uBlock-issues/issues/372
- https://github.com/gorhill/uBlock/issues/93
A new advanced settings has been added: `autoCommentFilterTemplate`.
Default value is `{{date}} {{origin}}`.
Placeholders are identified by `{{...}}`. There are currently
only three placeholders supported:
- `{{date}}`: will be replaced with current date
- `{{time}}`: will be replaced with current time
- `{{origin}}`: will be replaced with site information on which
the filter(s) was created
If no placeholder is found in `autoCommentFilterTemplate`, this
will disable auto-commenting. So one can use `-` to disable
auto-commenting.
Additionally, if auto-commenting is enabled, uBO will not emit a
comment if an emitted comment would be a duplicate of the last
one found in the user filter list.
commit 7c6cacc59b27660fabacb55d668ef099b222a9e6
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sat Nov 3 08:52:51 2018 -0300
code review: finalize support for wasm-based hntrie
commit 8596ed80e3bdac2c36e3c860b51e7189f6bc8487
Merge: cbe1f2e 000eb82
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sat Nov 3 08:41:40 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit cbe1f2e2f38484d42af3204ec7f1b5decd30f99e
Merge: 270fc7f dbb7e80
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 17:43:20 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit 270fc7f9b3b73d79e6355522c1a42ce782fe7e5c
Merge: d2a89cf d693d4f
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 16:21:08 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit d2a89cf28f0816ffd4617c2c7b4ccfcdcc30e1b4
Merge: d7afc78 649f82f
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 14:54:58 2018 -0300
Merge branch 'master' of github.com:gorhill/uBlock into trie-wasm
commit d7afc78b5f5675d7d34c5a1d0ec3099a77caef49
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 13:56:11 2018 -0300
finalize wasm-based hntrie implementation
commit e7b9e043cf36ad055791713e34eb0322dec84627
Author: Raymond Hill <rhill@raymondhill.net>
Date: Fri Nov 2 08:14:02 2018 -0300
add first-pass implementation of wasm version of hntrie
commit 1015cb34624f3ef73ace58b58fe4e03dfc59897f
Author: Raymond Hill <rhill@raymondhill.net>
Date: Wed Oct 31 17:16:47 2018 -0300
back up draft work toward experimenting with wasm hntries
Squashed commit of the following:
commit 6a8473822537636ac54d5dabdb14472114bb730b
Author: Raymond Hill <rhill@raymondhill.net>
Date: Mon Aug 6 10:56:44 2018 -0400
remove remnant of snappyjs and spurious instruction
commit 9a4b709bee97d3cc2235fab602359fa5953bdb46
Author: Raymond Hill <rhill@raymondhill.net>
Date: Mon Aug 6 09:48:58 2018 -0400
make cache storage compression optionally available on all platforms
New advanced setting: `cacheStorageCompression`. Default is `false`.
commit 22ee6547f2f7c9c5aefe25dea1262a1b31612155
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sun Aug 5 19:16:26 2018 -0400
remove Chromium from lz4 experiment
commit ee3e201c45afe983508f70713a2d43af74737d8d
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sun Aug 5 18:52:43 2018 -0400
import lz4-block-codec.wasm library
commit 883a3118efcfd749c82356fde7134754d6ae371d
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sun Aug 5 18:50:46 2018 -0400
implement storage compression through lz4-wasm [draft]
commit 48d1ccaba407de447c2cd6747dc3a90839c260a7
Merge: 8ae77e6 b34c897
Author: Raymond Hill <rhill@raymondhill.net>
Date: Sat Aug 4 08:56:51 2018 -0400
Merge branch 'master' of github.com:gorhill/uBlock into lz4
commit 8ae77e6aeeaa85af335e664c2560d2afd37288c6
Author: Raymond Hill <rhill@raymondhill.net>
Date: Wed Jul 25 18:17:45 2018 -0400
experiment with compression