Commit graph

221 commits

Author SHA1 Message Date
Raymond Hill
Fine-tune new bidi-trie code
Related issue:
2019-10-29 10:26:34 -04:00
Raymond Hill
Add WASM implementation for BidiTrieContainer.matches()
Related issue:
2019-10-28 13:57:35 -04:00
Raymond Hill
Fix WASM memory allocation in bidi-trie
Related issue:
2019-10-27 08:36:17 -04:00
Raymond Hill
Add WASM versions for some bidi-trie methods
Related issue:

Changes related to above issue made it possible to
create WASM versions of methods used in the bidi-trie.
In this commit, WASM versions for startsWith(), indexOf()
and lastIndexOf() have been implemented.
2019-10-26 13:13:53 -04:00
Raymond Hill
Exclude data type (i.e. csp=) from bidi-trie
We need a `matchAll()` method on the bidi-trie before
we can store filters of type `data` in it.

Related issue:

Related commit:
- 7971b22385
2019-10-22 18:14:49 -04:00
Raymond Hill
Expand bidi-trie usage in static network filtering engine
Related issues:

The previous bidi-trie code could only hold filters which
are plain pattern, i.e. no wildcard characters, and which
had no origin option (`domain=`), right and/or left anchor,
and no `csp=` option.

Example of filters that could be moved into a bidi-trie
data structure:


Examples of filters that could NOT be moved to a bidi-trie:


Ideally the filters above should be able to be moved to a
bidi-trie since they are basically plain patterns, or at
least partially moved to a bidi-trie when there is only a
single wildcard (i.e. made of two plain patterns).

Also, there were two distinct bidi-tries in which
plain-pattern filters can be moved to: one for patterns
without hostname anchoring and another one for patterns
with hostname-anchoring. This was required because the
hostname-anchored patterns have an extra condition which
is outside the bidi-trie knowledge.

This commit expands the number of filters which can be
stored in the bidi-trie, and also remove the need to
use two distinct bidi-tries.

- Added ability to associate a pattern with an integer
  in the bidi-trie [1].
    - The bidi-trie match code passes this externally
      provided integer when calling an externally
      provided method used for testing extra conditions
      that may be present for a plain pattern found to
      be matching in the bidi-trie.

- Decomposed existing filters into smaller logical units:
    - FilterPlainLeftAnchored =>
        FilterPatternPlain +
    - FilterPlainRightAnchored =>
        FilterPatternPlain +
    - FilterExactMatch =>
        FilterPatternPlain +
        FilterAnchorLeft +
    - FilterPlainHnAnchored =>
        FilterPatternPlain +
    - FilterWildcard1 =>
        FilterPatternPlain + [
          FilterPatternLeft or
    - FilterWildcard1HnAnchored =>
        FilterPatternPlain + [
          FilterPatternLeft or
        ] +
    - FilterGenericHnAnchored =>
        FilterPatternGeneric +
    - FilterGenericHnAndRightAnchored =>
        FilterPatternGeneric +
        FilterAnchorRight +
    - FilterOriginMixedSet =>
        FilterOriginMissSet +
    - Instances of FilterOrigin[...], FilterDataHolder
      can also be added to a composite filter to
      represent `domain=` and `csp=` options.

- Added a new filter class, FilterComposite, for
  filters which are a combination of two or more
  logical units. A FilterComposite instance is a
  match when *all* filters composing it are a

Since filters are now encoded into combination of
smaller units, it becomes possible to extract the
FilterPatternPlain component and store it in the
bidi-trie, and use the integer as a handle for the
remaining extra conditions, if any.

Since a single pattern in the bidi-trie may be a
component for different filters, the associated
integer points to a sequence of extra conditions,
and a match occurs as soon as one of the extra
conditions (which may itself be a sequence of
conditions) is fulfilled.

Decomposing filters which are currently single
instance into sequences of smaller logical filters
means increasing the storage and CPU overhead when
evaluating such filters. The CPU overhead is
compensated by the fact that more filters can now
moved into the bidi-trie, where the first match is
efficiently evaluated. The extra conditions have to
be evaluated if and only if there is a match in the

The storage overhead is compensated by the
bidi-trie's intrinsic nature of merging similar

Furthermore, the storage overhead is reduced by no
longer using JavaScript array to store collection
of filters (which is what FilterComposite is):
the same technique used in [2] is imported to store
sequences of filters.

A sequence of filters is a sequence of integer pairs
where the first integer is an index to an actual
filter instance stored in a global array of filters
(`filterUnits`), while the second integer is an index
to the next pair in the sequence -- which means all
sequences of filters are encoded in one single array
of integers (`filterSequences` => Uint32Array). As
a result, a sequence of filters can be represented by
one single integer -- an index to the first pair --
regardless of the number of filters in the sequence.

This representation is further leveraged to replace
the use of JavaScript array in FilterBucket [3],
which used a JavaScript array to store collection
of filters. Doing so means there is no more need for
FilterPair [4], which purpose was to be a lightweight
representation when there was only two filters in a

As a result of the above changes, the map of `token`
(integer)  => filter instance (object) used to
associate tokens to filters or collections of filters
is replaced with a more efficient map of `token`
(integer) to filter unit index (integer) to lookup a
filter object from the global `filterUnits` array.

Another consequence of using one single global
array to store all filter instances means we can reuse
existing instances when a logical filter instance is
parameter-less, which is the case for FilterAnchorLeft,
FilterAnchorRight, FilterAnchorHn, the index to these
single instances is reused where needed.

`urlTokenizer` now stores the character codes of the
scanned URL into a bidi-trie buffer, for reuse when
string matching methods are called.

New method: `tokenHistogram()`, used to generate
histograms of occurrences of token extracted from URLs
in built-in benchmark. The top results of the "miss"
histogram are used as "bad tokens", i.e. tokens to
avoid if possible when compiling filter lists.

All plain pattern strings are now stored in the
bidi-trie memory buffer, regardless of whether they
will be used in the trie proper or not.

Three methods have been added to the bidi-trie to test
stored string against the URL which is also stored in
then bidi-trie.

FilterParser is now instanciated on demand and
released when no longer used.


[1] 135a45a878/src/js/strie.js (L120)
[2] e94024d350
[3] 135a45a878/src/js/static-net-filtering.js (L1630)
[4] 135a45a878/src/js/static-net-filtering.js (L1566)
2019-10-21 08:15:58 -04:00
Raymond Hill
Fix minor bugs spotted during code review 2019-10-14 09:03:51 -04:00
Raymond Hill
Rename register-like variables
Use leading `$` instead of trailing `$` to denote
register-like variables, this conveniently allows
to group them together in the debugger.
2019-09-29 13:21:09 -04:00
Raymond Hill
Store csp= filters into main data structure
This commits make it so that `csp=` filters
are now stored in the same data structures as
all other static network filters rather than
being stored in a separate one.

This internal change is motivated by the wish
to bring session filters to the static network
filtering engine, as has already been done for
the static extended filtering engine in the
following commit:

2019-09-28 11:30:26 -04:00
Raymond Hill
Add support for ping static filter option
Related issue:


Test page:

Additionally, network requests of type `beacon` will
be mapped to `ping` by the static filtering engine.
2019-09-22 09:11:55 -04:00
Raymond Hill
Add support for elemhide (through specifichide)
Related documentation:

Related feedback/discussion:

The `elemhide` filter option as per ABP semantic is
now supported. Previously uBO would consider `elemhide`
to be an alias of `generichide`.

The support of `elemhide` is through the convenient
conversion of `elemhide` option into existing
`generichide` option and new `specifichide` option.

The purpose of the new `specifichide` filter option
is to disable all specific cosmetic filters, i.e.
those who target a specific site.

Additionally, for convenience purpose, the filter
options `generichide`, `specifichide` and `elemhide`
can be aliased using the shorter forms `ghide`,
`shide` and `ehide` respectively.
2019-09-21 11:30:38 -04:00
Raymond Hill
Code review fix re. max string length in bidi-trie
Related commit:
- fb4e94f92c

A bidi-trie can't store strings longer than 255 characters
because the string segment lengths are encoded into a single
byte. This commit ensures only strings smaller than
256 characters are stored in the bidi-tries.
2019-08-23 11:30:10 -04:00
Raymond Hill
Add support for AdGuard's mp4 filter option
Related discussion:

The `mp4` filter option will be converted to `redirect=noopmp4-1s`
internally, and `media` type will be assumed.
2019-08-13 12:30:11 -04:00
Raymond Hill
Add support for AdGuard's empty option
Related issue:

The filter option `empty` is converted to `redirect=empty`
by uBO internally; however unlike when the `redirect=`
option is used expressly, the `empty` option does not
require a resource type.

When `empty` is used, only network requests which are meant
to return a text response will be redirected to an empty
response body by uBO -- so `empty` will not work for
resources such as images, media, or other binary resources.
2019-08-13 08:16:21 -04:00
Raymond Hill
Add new static network filter option: redirect-rule=
Related issue:

The purpose of this new option is to add the ability
to create standalone redirect rule without being forced
to create a block filter (a corresponding block filter
is always created when using the `redirect=`).


The syntax `*$redirect=token,...` is now supported, there
is no need to "trick" the filter parser with
`*/*$redirect=token,...` in order to create redirect rules
which are meant to match all paths.

Filters of the form `|http*://` will be normalized into
two corresponding filters `|https://` and `|http://` so as
to reduce the number of filters in the buckets of
untokenizable filters.
2019-08-03 10:18:47 -04:00
Raymond Hill
Fix some element picker-related issues
Related discussion:

Make the element picker better reflect network filters as
parsed by the static network filtering engine. Additionally,
discard single alphanumeric character-based filters.

Related discussion:

Inject newly created cosmetic filters into the DOM
filterer, in order for these filters to be enforced by
the DOM filterer in subsequent dynamic DOM changes.
2019-06-29 11:06:03 -04:00
Raymond Hill
Code review of HNTrie/staticNetFilteringEngine
- Remove HNTrieContainer class from global context by
  storing it as a property of µBlock.

- Use block scope to isolate HNTrie-related constants
  from global context.

- Prevent filters which are pure IP address from
  being stored in an HNTrie instance -- as this
  could cause false positives.
2019-06-19 10:00:19 -04:00
Raymond Hill
Implement bidirectional plain-string trie
The bidirectional trie allows storing the right
and left parts of a string into a trie given a
pivot position.

Releated issue:

Additionally, the mandatory token-at-index-0 rule
for FilterPlainHnAnchored has been lifted, thus
allowing the engine to pick a potentially better token
at any position in the filter string.


TODO: Eventually rename `strie.js` to `biditrie.js`.

TODO: Fix dump() method, it currently only show the
      right-hand side of a filter string.
2019-06-18 19:16:39 -04:00
Raymond Hill
Discard whole filter with bad csp= content
Related discussion:

uBO was just removing the bad option, while the whole
filter needs to be discarded.
2019-05-24 15:41:37 -04:00
Raymond Hill
Fix regression affecting *$csp=-like filters
Related discussion:

Regression introduced in:
- 3f3a1543ea
2019-05-24 12:15:32 -04:00
Raymond Hill
Minor code reivew of 4430ec11e2 2019-05-23 08:15:26 -04:00
Raymond Hill
Start using async/await where it makes sense 2019-05-22 19:23:04 -04:00
Raymond Hill
Rearrange inner loop of static network filtering engine
The motivations for the re-arrangement:

- Reducing the number of entry points:
  matchStringExactString() has been removed and
  matchString() is simply reused with a modifier parameter
  to enable matching variants.

- Presumption that most matches, if any, occur early with
  the left-most tokens in a URL. This gives a very small
  marginal performance gain as per built-in benchmark.
2019-05-22 17:51:03 -04:00
Raymond Hill
Re-arrange parsing of type options to be order-independent
Related commit:
- 1888033070

This removes the need to place `all` before any negated
type in the list of options.
2019-05-21 14:04:21 -04:00
Raymond Hill
Add support for all filter option
Related discussion:

The `all` option is equivalent to specifying all
network-based types + `popup`, `document`,
`inline-font`, `inline-script`.

Example from discussion:


Above will block all network requests, block all popups,
prevent inline fonts/scripts from ``. EasyList-
compatible syntax does not allow to accomplish that
semantic when using only `||^`.

If using specific negated type options along with `all`,
the order in which the options appear is important. In
such case `all` should always be first, followed by
the negated type option(s).
2019-05-20 13:46:36 -04:00
Raymond Hill
Avoid duplicated strings in filterOrigin w/ new approach
The new approach is simpler and should benefit selfie

This renders stringDeduplicater obsolete -- it has been
2019-05-17 10:13:58 -04:00
Raymond Hill
Fix incorrect use of this in static method
Related issue:

Regression from:
- 19ece97b0c
2019-05-11 17:40:55 -04:00
Raymond Hill
Add HNTrieRef.dump() and STrieRef.dump() as dev tool
To be used at the console, as an investigation tool for
development purpose.

Using it to verify the content of the largest
FilterHostnameDict instance, I spotted an all-uppercase
hostname in the HNTrieRef instance:


Thus the changes to static-net-filtering.js are to fix
the erroneous insertion of filters with uppercase
characters. The single instance found was a hostname entry
in Malware Domain List (TRIANGLESERVICESLTD dot COM).
2019-05-06 11:12:39 -04:00
Raymond Hill
Remove unecessary null placeholders FilterOriginHitSet et al.
The `null` placeholder are not necessary, we can just use
default arguments instead, and add the HNTrieContainer
references if and only if they are instanciated.
2019-05-01 18:54:11 -04:00
Raymond Hill
Increase resolution of known-token lookup table
Related commit:
- 69a43e07c4

Using 32 bits of token hash rather than just the 16 lower
bits does help discard more unknown tokens.

Using the default filter lists, the known-token lookup
table is populated by 12,276 entries, out of 65,536, thus
making the case that theoretically there is a lot of
possible tokens which can be discarded.

In practice, running the built-in
staticNetFilteringEngine.benchmark() with default filter
lists, I find that 1,518,929 tokens were skipped out of
4,441,891 extracted tokens, or 34%.
2019-04-27 08:18:01 -04:00
Raymond Hill
Fix list lookup of multi-hostname domain= filters in logger
Related commit:
- 3f3a1543ea

The regression was preventing uBO to find from which list a filter
originated. This affected only filters for which the `domain=`
option had multiple hostnames.
2019-04-27 07:04:43 -04:00
Raymond Hill
Ignore unknown tokens in urlTokenizer.getTokens()
Given that all tokens extracted from one single URL are potentially
iterated multiple times in a single URL-matching cycle, it pays to
ignore extracted tokens which are known to not be used anywhere in
the static filtering engine.

The gain in processing a single network request in the static
filtering engine can become especially high when dealing with
long and random-looking URLs, which URLs have a high likelihood
of containing a majority of tokens which are known to not be in
2019-04-26 17:14:00 -04:00
Raymond Hill
Leverage compile-time token information in new fitler classes
Related commit:
- 99390390fc

The token information available at compile time can be stored
in the filter to be used at match() time. This allows the use of
startsWith() rather than a more costly indexOf() call as a first
quick test to detect mismatches.
2019-04-26 11:16:47 -04:00
Raymond Hill
Introduce three more specialized filter classes to avoid regexes
Performance- and memory-related work. Three more classes have
been created to avoid regex-based filters internally.

Purpose is to enforce filters which have only one single
wildcard in their pattern, a common occurrence. The filter
pattern is split in two literal string segments.

Similar as above, with the added condition that the filter is
hostname-anchored (`||`). The "Wildcard2" variant is a further
specialization to enforce filters where the only wildcard
is immediately preceded by the `^` special character, again
a very common occurrence.

Using two literal string segments in lieu of regexes allows to
quickly detect a mismatch by just testing the first segment.
Additionally, this reduces memory footprint as regexes are
much more expensive memory-wise than plain strings.

These three new filter classes allow to replace the use of
5276 regex-based filters internally with plain string-based

Often-called isHnAnchored() has been further fine-tuned to
avoid as much work as possible. I have also observed that
using an arrow function for closure-purpose helps measurably
performance, as per built-in benchmark.
2019-04-25 17:48:08 -04:00
Raymond Hill
Fix overzealous strict blocking (regression)
Related issue:

Regression from:
- 3f3a1543ea (diff-522a16ddeed280252d7c3a351261b441R2767)
2019-04-21 09:17:31 -04:00
Raymond Hill
Fix how *$, |https://, http:// filters are reported in logger
This was a regression introduced in

Reported in issue:
2019-04-20 17:25:32 -04:00
Raymond Hill
Use a sequence of base 64 numbers to encode array buffers
The purpose of using a custom base128 encoder is to
convert array buffers into strings, to allow a direct
string-to-array buffer conversion at load time:

  string => array buffer

Whereas a JSON array would require an extra step:

  JSON array as string => JS array => array buffer

Turns out that the current use of a custom base128 encoding
results in a significantly larger selfie storage usage when
converting array buffers into strings.

Speculation: possibly the browser convert the strings to
save into JSON strings internally. Since the custom base128
encoder is likely to cause the resulting string to contain
a lot of unprintable ASCII characters, these will need to
be escaped when converted to JSON -- escaped characters
occupy more space than non-escaped ones.

Using a sequence of base 64 numbers means only printable
will be present in the output string, hence no escaping
necessary. I have observed significant reduction in
storage usage for selfie purpose.
2019-04-20 09:06:54 -04:00
Raymond Hill
Add HNTrie-based filter classes to store origin-only filters
Related issue:

Following STrie-related work in above issue, I noticed that a large
number of filters in EasyList were filters which only had to match
against the document origin. For instance, among just the top 10
most populous buckets, there were four such buckets with over
hundreds of entries each:

- bits: 72, token: "http", 146 entries
- bits: 72, token: "https", 139 entries
- bits: 88, token: "http", 122 entries
- bits: 88, token: "https", 118 entries

These filters in these buckets have to be matched against all
the network requests.

In order to leverage HNTrie for these filters[1], they are now handled
in a special way so as to ensure they all end up in a single HNTrie
(per bucket), which means that instead of scanning hundreds of entries
per URL, there is now a single scan per bucket per URL for these
apply-everywhere filters.

Now, any filter which fulfill ALL the following condition will be
processed in a special manner internally:

- Is of the form `|https://` or `|http://` or `*`; and
- Does have a `domain=` option; and
- Does not have a negated domain in its `domain=` option; and
- Does not have `csp=` option; and
- Does not have a `redirect=` option

If a filter does not fulfill ALL the conditions above, no change
in behavior.

A filter which matches ALL of the above will be processed in a special

- The `domain=` option will be decomposed so as to create as many
  distinct filter as there is distinct value in the `domain=` option
- This also apply to the `badfilter` version of the filter, which
  means it now become possible to `badfilter` only one of the
  distinct filter without having to `badfilter` all of them.
- The logger will always report these special filters with only a
  single hostname in the `domain=` option.


[1] HNTrie is currently WASM-ed on Firefox.
2019-04-19 16:33:46 -04:00
Raymond Hill
Cleanup comments following changes in 34f3cfe5e7 2019-04-16 19:20:56 -04:00
Raymond Hill
Add filterClassHistogram() method to µBlock.staticNetFilteringEngine
As a development tool for investigation purpose. To use, enter the
following at uBO's dev console:

2019-04-16 19:01:14 -04:00
Raymond Hill
Categorize google as a bad token for map key purpose
In the static network filtering engine, `google` token is too
generic and probably leads to too many false positives, beside
causing too large filter bucket.
2019-04-16 06:52:13 -04:00
Raymond Hill
Add µBlock.staticNetFilteringEngine.bucketHistogram() as investigative dev tool
Additionally, lower the treshold of trieability to 4 for FilterPlainPrefix1.
2019-04-15 11:45:33 -04:00
Raymond Hill
Performance + code maintenance work on static network filtering engine
Implement a plain string trie container class: STrieContainer.

Make use of STrieContainer where beneficial

  Some filter buckets can grow quite large, and in such case
  coalescing "trieable" filter classes into a single trie reduces
  lookup performance and memory usage.

  For instance, at time of commit, the filter bucket for the
  `ad` keyword contains 919 entries[1].

  Coalescing trieable filters of the same class into a single plain
  string trie reduced the size of the bucket into 50 entries + two
  tries which are scanned only once each whenever the bucket is

  [1] Enter the following code at uBO's dev console:

Refactor static network filtering engine code to make use of
ES6's syntactic sugar `class`.

Change first auto-update run from 7 to 5 minutes.
2019-04-14 16:45:20 -04:00
Raymond Hill
Improve usefulness of FilterContainer.benchmark()
Add ability to test/record results. This allows to compare against
output after code changes to detect and more accurately report
2019-04-14 09:44:24 -04:00
Raymond Hill
Report block count in benchmark()
The block count can be used for testing against regression after
code changes.
2019-04-12 10:19:38 -04:00
Noelle Leigh
0bb7b76338 Fixed wrong method for number of elements in a Map (#3755) 2019-04-06 16:42:24 -03:00
Raymond Hill
Add support to benchmark the dynamic filtering pane
From uBO's dev console, type:
- `µBlock.sessionFirewall.benchmark();`

Keep in mind that it's the temporary ruleset being benchmarked.
2019-02-19 10:46:33 -05:00
Raymond Hill
Properly set resource URL in benchmark loop 2019-02-17 07:45:05 -05:00
Raymond Hill
Remove obsolete code to translate |blob: filters into CSP filters
These filters are to be considered obsolete since they can't be
matched against network requests in the webRequest API.

They were probably meant to work when ABP was pre-webext, which
means they are quite probably obsolete and there is no longer
a point for uBO to conveniently translate them into CSP directives.
2019-02-16 19:25:15 -05:00
Raymond Hill
Spin-off FilterOrigin flavors into standalone classes
This removes the derivation of FilterOrigin flavors from
FilterOrigin itself and simplify code paths. FilterOrigin
flavors are small specialized classes, no need to
overcomplicate with derivation.

Specifically, this removes an indirect call to reach the
match() method.
2019-02-16 12:16:30 -05:00