external/uBlock - Forgejo: Beyond coding. We forge.

external/uBlock

mirror of https://github.com/gorhill/uBlock.git synced 2024-11-11 17:41:03 +01:00

Author	SHA1	Message	Date
Raymond Hill	40de7d6489	Fix WASM memory allocation in bidi-trie Related issue: - https://github.com/uBlockOrigin/uBlock-issues/issues/764	2019-10-27 08:36:17 -04:00
Raymond Hill	c71624d1da	Fix access to detached buffer when using WASM in bidi-trie Related issue: - https://github.com/uBlockOrigin/uBlock-issues/issues/764 WebAssembly.Memory.grow() preserves the buffer content when we grow it, no need to manually copy it. Doing so was causing an access to a no longer valid ArrayBuffer.	2019-10-26 17:37:47 -04:00
Raymond Hill	b0cbc47d9a	Add WASM versions for some bidi-trie methods Related issue: - https://github.com/uBlockOrigin/uBlock-issues/issues/761 Changes related to above issue made it possible to create WASM versions of methods used in the bidi-trie. In this commit, WASM versions for startsWith(), indexOf() and lastIndexOf() have been implemented.	2019-10-26 13:13:53 -04:00
Raymond Hill	30393fdcf1	Exclude data type (i.e. `csp=`) from bidi-trie We need a `matchAll()` method on the bidi-trie before we can store filters of type `data` in it. Related issue: - https://github.com/uBlockOrigin/uBlock-issues/issues/761 Related commit: - `7971b22385`	2019-10-22 18:14:49 -04:00
Raymond Hill	7971b22385	Expand bidi-trie usage in static network filtering engine Related issues: - https://github.com/uBlockOrigin/uBlock-issues/issues/761 - https://github.com/uBlockOrigin/uBlock-issues/issues/528 The previous bidi-trie code could only hold filters which are plain pattern, i.e. no wildcard characters, and which had no origin option (`domain=`), right and/or left anchor, and no `csp=` option. Example of filters that could be moved into a bidi-trie data structure: &ad_box_ /w/d/capu.php?z=$script,third-party \|\|liveonlinetv247.com/images/muvixx-150x50-watch-now-in-hd-play-btn.gif Examples of filters that could NOT be moved to a bidi-trie: -adap.$domain=~l-adap.org /tsc.php?&ses= \|\|ibsrv.net/forumsponsor$domain=[...] @@\|\|imgspice.com/jquery.cookie.js\|$script \|\|view.atdmt.com^/iview/$third-party \|\|postimg.cc/image/$csp=[...] Ideally the filters above should be able to be moved to a bidi-trie since they are basically plain patterns, or at least partially moved to a bidi-trie when there is only a single wildcard (i.e. made of two plain patterns). Also, there were two distinct bidi-tries in which plain-pattern filters can be moved to: one for patterns without hostname anchoring and another one for patterns with hostname-anchoring. This was required because the hostname-anchored patterns have an extra condition which is outside the bidi-trie knowledge. This commit expands the number of filters which can be stored in the bidi-trie, and also remove the need to use two distinct bidi-tries. - Added ability to associate a pattern with an integer in the bidi-trie [1]. - The bidi-trie match code passes this externally provided integer when calling an externally provided method used for testing extra conditions that may be present for a plain pattern found to be matching in the bidi-trie. - Decomposed existing filters into smaller logical units: - FilterPlainLeftAnchored => FilterPatternPlain + FilterAnchorLeft - FilterPlainRightAnchored => FilterPatternPlain + FilterAnchorRight - FilterExactMatch => FilterPatternPlain + FilterAnchorLeft + FilterAnchorRight - FilterPlainHnAnchored => FilterPatternPlain + FilterAnchorHn - FilterWildcard1 => FilterPatternPlain + [ FilterPatternLeft or FilterPatternRight ] - FilterWildcard1HnAnchored => FilterPatternPlain + [ FilterPatternLeft or FilterPatternRight ] + FilterAnchorHn - FilterGenericHnAnchored => FilterPatternGeneric + FilterAnchorHn - FilterGenericHnAndRightAnchored => FilterPatternGeneric + FilterAnchorRight + FilterAnchorHn - FilterOriginMixedSet => FilterOriginMissSet + FilterOriginHitSet - Instances of FilterOrigin[...], FilterDataHolder can also be added to a composite filter to represent `domain=` and `csp=` options. - Added a new filter class, FilterComposite, for filters which are a combination of two or more logical units. A FilterComposite instance is a match when all* filters composing it are a match. Since filters are now encoded into combination of smaller units, it becomes possible to extract the FilterPatternPlain component and store it in the bidi-trie, and use the integer as a handle for the remaining extra conditions, if any. Since a single pattern in the bidi-trie may be a component for different filters, the associated integer points to a sequence of extra conditions, and a match occurs as soon as one of the extra conditions (which may itself be a sequence of conditions) is fulfilled. Decomposing filters which are currently single instance into sequences of smaller logical filters means increasing the storage and CPU overhead when evaluating such filters. The CPU overhead is compensated by the fact that more filters can now moved into the bidi-trie, where the first match is efficiently evaluated. The extra conditions have to be evaluated if and only if there is a match in the bidi-trie. The storage overhead is compensated by the bidi-trie's intrinsic nature of merging similar patterns. Furthermore, the storage overhead is reduced by no longer using JavaScript array to store collection of filters (which is what FilterComposite is): the same technique used in [2] is imported to store sequences of filters. A sequence of filters is a sequence of integer pairs where the first integer is an index to an actual filter instance stored in a global array of filters (`filterUnits`), while the second integer is an index to the next pair in the sequence -- which means all sequences of filters are encoded in one single array of integers (`filterSequences` => Uint32Array). As a result, a sequence of filters can be represented by one single integer -- an index to the first pair -- regardless of the number of filters in the sequence. This representation is further leveraged to replace the use of JavaScript array in FilterBucket [3], which used a JavaScript array to store collection of filters. Doing so means there is no more need for FilterPair [4], which purpose was to be a lightweight representation when there was only two filters in a collection. As a result of the above changes, the map of `token` (integer) => filter instance (object) used to associate tokens to filters or collections of filters is replaced with a more efficient map of `token` (integer) to filter unit index (integer) to lookup a filter object from the global `filterUnits` array. Another consequence of using one single global array to store all filter instances means we can reuse existing instances when a logical filter instance is parameter-less, which is the case for FilterAnchorLeft, FilterAnchorRight, FilterAnchorHn, the index to these single instances is reused where needed. `urlTokenizer` now stores the character codes of the scanned URL into a bidi-trie buffer, for reuse when string matching methods are called. New method: `tokenHistogram()`, used to generate histograms of occurrences of token extracted from URLs in built-in benchmark. The top results of the "miss" histogram are used as "bad tokens", i.e. tokens to avoid if possible when compiling filter lists. All plain pattern strings are now stored in the bidi-trie memory buffer, regardless of whether they will be used in the trie proper or not. Three methods have been added to the bidi-trie to test stored string against the URL which is also stored in then bidi-trie. FilterParser is now instanciated on demand and released when no longer used. *** [1] `135a45a878/src/js/strie.js (L120)` [2] `e94024d350` [3] `135a45a878/src/js/static-net-filtering.js (L1630)` [4] `135a45a878/src/js/static-net-filtering.js (L1566)`	2019-10-21 08:15:58 -04:00
Raymond Hill	9f7e385a5c	Code review fix re. max string length in bidi-trie Related commit: - `fb4e94f92c` A bidi-trie can't store strings longer than 255 characters because the string segment lengths are encoded into a single byte. This commit ensures only strings smaller than 256 characters are stored in the bidi-tries.	2019-08-23 11:30:10 -04:00
Raymond Hill	fb4e94f92c	Fix spurious 256-char limit for filters stored in bidi-trie Plain filters can be any length, while the bidi-trie was assuming max length of 256. The origin of this error is from when the bidi-trie code was originally imported from the hntrie code as start to develop bidi-trie. The erroneous code could cause issue when the following conditions were met: - Plain filter longer than 256 characters - Free space in bidi-trie's character buffer was less than 256 bytes Likely a rare occurrence, but some filter lists do contains long plain filters, for example: https://gitlab.com/curben/urlhaus-filter/raw/master/urlhaus-filter.txt Related feedback: - https://www.reddit.com/r/uBlockOrigin/comments/cs1y26/	2019-08-22 17:11:49 -04:00
Raymond Hill	cfc2ce333d	Implement bidirectional plain-string trie The bidirectional trie allows storing the right and left parts of a string into a trie given a pivot position. Releated issue: - https://github.com/uBlockOrigin/uBlock-issues/issues/528 Additionally, the mandatory token-at-index-0 rule for FilterPlainHnAnchored has been lifted, thus allowing the engine to pick a potentially better token at any position in the filter string. *** TODO: Eventually rename `strie.js` to `biditrie.js`. TODO: Fix dump() method, it currently only show the right-hand side of a filter string.	2019-06-18 19:16:39 -04:00
Raymond Hill	3692bb4ada	Add HNTrieRef.dump() and STrieRef.dump() as dev tool To be used at the console, as an investigation tool for development purpose. Using it to verify the content of the largest FilterHostnameDict instance, I spotted an all-uppercase hostname in the HNTrieRef instance: µBlock.staticNetFilteringEngine.categories.get(0).get(0x10000000).dict.dump(); Thus the changes to static-net-filtering.js are to fix the erroneous insertion of filters with uppercase characters. The single instance found was a hostname entry in Malware Domain List (TRIANGLESERVICESLTD dot COM).	2019-05-06 11:12:39 -04:00
Raymond Hill	fa83744b58	Use a sequence of base 64 numbers to encode array buffers The purpose of using a custom base128 encoder is to convert array buffers into strings, to allow a direct string-to-array buffer conversion at load time: string => array buffer Whereas a JSON array would require an extra step: JSON array as string => JS array => array buffer Turns out that the current use of a custom base128 encoding results in a significantly larger selfie storage usage when converting array buffers into strings. Speculation: possibly the browser convert the strings to save into JSON strings internally. Since the custom base128 encoder is likely to cause the resulting string to contain a lot of unprintable ASCII characters, these will need to be escaped when converted to JSON -- escaped characters occupy more space than non-escaped ones. Using a sequence of base 64 numbers means only printable will be present in the output string, hence no escaping necessary. I have observed significant reduction in storage usage for selfie purpose.	2019-04-20 09:06:54 -04:00
Raymond Hill	3f3a1543ea	Add HNTrie-based filter classes to store origin-only filters Related issue: - https://github.com/uBlockOrigin/uBlock-issues/issues/528#issuecomment-484408622 Following STrie-related work in above issue, I noticed that a large number of filters in EasyList were filters which only had to match against the document origin. For instance, among just the top 10 most populous buckets, there were four such buckets with over hundreds of entries each: - bits: 72, token: "http", 146 entries - bits: 72, token: "https", 139 entries - bits: 88, token: "http", 122 entries - bits: 88, token: "https", 118 entries These filters in these buckets have to be matched against all the network requests. In order to leverage HNTrie for these filters[1], they are now handled in a special way so as to ensure they all end up in a single HNTrie (per bucket), which means that instead of scanning hundreds of entries per URL, there is now a single scan per bucket per URL for these apply-everywhere filters. Now, any filter which fulfill ALL the following condition will be processed in a special manner internally: - Is of the form `\|https://` or `\|http://` or ``; and - Does have a `domain=` option; and - Does not have a negated domain in its `domain=` option; and - Does not have `csp=` option; and - Does not have a `redirect=` option If a filter does not fulfill ALL the conditions above, no change in behavior. A filter which matches ALL of the above will be processed in a special manner: - The `domain=` option will be decomposed so as to create as many distinct filter as there is distinct value in the `domain=` option - This also apply to the `badfilter` version of the filter, which means it now become possible to `badfilter` only one of the distinct filter without having to `badfilter` all of them. - The logger will always report these special filters with only a single hostname in the `domain=` option. ** [1] HNTrie is currently WASM-ed on Firefox.	2019-04-19 16:33:46 -04:00
Raymond Hill	c229003d31	Performance + code maintenance work on static network filtering engine Implement a plain string trie container class: STrieContainer. Make use of STrieContainer where beneficial Some filter buckets can grow quite large, and in such case coalescing "trieable" filter classes into a single trie reduces lookup performance and memory usage. For instance, at time of commit, the filter bucket for the `ad` keyword contains 919 entries[1]. Coalescing trieable filters of the same class into a single plain string trie reduced the size of the bucket into 50 entries + two tries which are scanned only once each whenever the bucket is visited. [1] Enter the following code at uBO's dev console: µBlock.staticNetFilteringEngine.categories.get(0).get(µBlock.urlTokenizer.tokenHashFromString('ad')) Refactor static network filtering engine code to make use of ES6's syntactic sugar `class`. Change first auto-update run from 7 to 5 minutes.	2019-04-14 16:45:20 -04:00