Above is not a full solution / workaround for:
or all the other projects on the internet - almost all - that would have to audit their existing source code for malicious unicode and prevent inclusion for future malicious unicode,
any of the other issues raised on https://trojansource.codes/ such as fixing compilers or text editors.
Thank you. Outreach on this issue is certainly helpful.
Best to include the link to the original attack research:
already mentioned in michael altfield article as a reference.
Patrick via Whonix Forum:
Gentoo:
https://bugs.gentoo.org/862372
Mint OS:
opened 06:57PM - 30 Jul 22 UTC
I couldn't find any proper place to report this, feel free to shift it if this i… s not the right place.
Quote [https://trojansource.codes/ Trojan Source: Invisible Vulnerabilities]
> **Invisible Source Code Vulnerabilities**
>
> Some Vulnerabilities are Invisible
> Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.
>
> These adversarial encodings produce no visual artifacts.
>
> **The trick**
>
> The trick is to use Unicode control characters to reorder tokens in source code at the encoding level.
> These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.
> Compilers and interpreters adhere to the logical ordering of source code, not the visual order.
>
> **The attack**
>
> The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.
> ...
> Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.
This attack pattern is tracked as CVE-2021-42574.
[CVE-2021-42574 at redhat](https://access.redhat.com/security/cve/cve-2021-42574)
> **The supply chain**
>
> This attack is particularly powerful within the context of software supply chains.
> If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.
>
> **The technique**
>
> There are multiple techniques that can be used to exploit the visual reordering of source code tokens:
> * Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment.
> * Commenting-Out causes a comment to visually appear as code, which in turn is not executed.
> * Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.
>
> **The variant**
>
> A similar attack exists which uses homoglyphs, or characters that appear near identical.
>
> ...
>
> The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference.
> An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code.
> This attack variant is tracked as CVE-2021-42694.
>
> **The defense**
>
> * Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.
> * Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.
> * Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.
>
> **The paper**
>
> Complete details can be found in the related [https://trojansource.codes/trojan-source.pdf paper].
By authors Nicholas Boucher and Ross Anderson, 2021, [https://arxiv.org/abs/2111.00169 arXiv].
tasks:
- [ ] **check if potential existing compromises:** scan all distribution source code for existing unicode
- [ ] **educate existing and future distribution source code reviewers:** add a distribution source code reviewer policy to a github repository or on the distribution website which existing and future reviewers need to acknowledge that I understand the issue. More of a reminder, a conversation starter.
- [ ] **remove as much unicode from distribution source code as possible**: by reducing the amount of unicode in distribution source code, audits for malicious unicode with automated tools gets simpler. If possible, if unicode is considered essential, instead of writing `®` when required it should be encoded as `®`.
- [ ] **local check by reviewer:** document tools that distribution source code reviewers could/should use to scan future contributions for malicious unicode
- [ ] **remote cursory check:** add a github pull request hook that notifies when unicode is included in a pull request (This is just an additional, handy layer of protection. Since infrastructure should be distrusted this alone is not a full solution.)
- [ ] **build scripts / CI scripts:** should check if there is unicode in any files except in opt-in expected files. If there is unexpected unicode, the build should error out.
- [ ] **scan upstream projects source code**: check if these are compromised by malicious unicode
- [ ] **notify upstream projects**: these might not be aware of this issue and already compromised by malicious unicode.
references:
* https://tech.michaelaltfield.net/2021/11/22/bidi-unicode-github-defense/
* https://www.kicksecure.com/wiki/Unicode
1 Like
Patrick
October 3, 2022, 12:10pm
17
In a LKRG source code file a comment includes a real name which contains this sign: ł
Non-malicious.
This triggers to dm-check-unicode
check.
Therefore excluding the files where this happens from the check.
This is clearly a non-ideal solution but fixing this is an issue for whole Free and Open Source community. See also Detecting Malicious Unicode in Source Code and Pull Requests
--exclude=LICENSE
--exclude=lkrg-openrc.sh
Kicksecure:master
← stirredaround:master
opened 03:04PM - 04 Apr 23 UTC
Could you review this please? @grass
grass
April 24, 2023, 3:44pm
19
First thing, I don’t know perl too much, but I can understand it. I tried to make grep print but it wasn’t working, so perl seems better for this, besides the fact that grep’s option -P
stands for Perl, so we were already using it.
I used the tool to scan the files on GitHub - nickboucher/trojan-source: Trojan Source: Invisible Vulnerabilities , especially on the Bash dir. Github web interface does not show all of the unicode, you have to use a local editor or paste to a functional online viewer such as Bidi Viewer which is made by the same person.
Another point is the pattern:
SEARCH_PATTERN='[^[:ascii:]]|[\x{061C}\x{200E}\x{200F}\x{202A}\x{202B}\x{202C}\x{202D}\x{202E}\x{2066}\x{2067}\x{2068}\x{2069}]'
I don’t see the need for the second part of everything after the pipe |
, because negating ascii characters will also contain the second part.
From this sample , using only [^[:ascii:]]
detected all the problems. I did a diff also from the whole directory using the full pattern and only the non-ascii and it was the same.
One thing I don’t like is printing No spurious characters found
because it gets in the way of the really important part, if there are spurious characters found. What do you think?
1 Like
buster
October 20, 2023, 8:12pm
23
GUYS thank you, this is fire
1 Like