Some thoughts on anti-stylometry research

h3xagonal · April 11, 2022, 8:52am

Greetings

I am a researcher interested in developing practical tools for improvement of operational security (instead of proposals that get lost in academic papers or theory and never see the light of day)

One path I have interest in is the art of stylometry and its counter-measures. I have developed a prototype of a tool that can automatically iterate paragraphs of text through a online translation interface chained through multiple languages in attempt to destroy any linguistic clues because the fact sentiment analysis of machine translation is rather ‘lossy’.

That is not a best practice because it involves disclosing plain-text to an untrusted party and this is the drive for a client-side solution.

I have seen the 4ANT project (A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation | USENIX) and some of its code (GitHub - rakshithShetty/A4NT-author-masking: Repository for author masking), but a issue with that is there is no documentation on actually using it. There is an old saying about academic research projects (paraphrasing) ‘You have the code and a paper go figure it out’ and it appears to be the situation here.

Maybe someone with enough time can correlate certain actions described in the paper with the code and get it working that way.

In any case. I would be of interest hearing about your experience.

Regards

HulaHoop · April 19, 2022, 9:30am

From what I’ve seen in research findings, this technique is not adequate and also has the downside of producing unintelligible text. A better way is to have a user compare their new text to samples of older text they shared before and have the tool clear it as being different.

If you have time to help us integrate an existing tool that does what I mentioned let us know.

h3xagonal · April 21, 2022, 4:39am

Yes. I also agree attempted mutation by preexisting machine-translation methods is not a good solution.

A better way is to have a user compare their new text to samples of older text they shared before and have the tool clear it as being different.

The A4NT whitepaper with its classification appears to demonstrate the semantics of such differentiation and ‘clearance’.

If you have time to help us integrate an existing tool that does what I mentioned let us know.

The A4NT whitepaper seems to be the most complete solution in terms of analyzing trade-offs. The code is there but with minimal documentation. Unfortunately I also have minimal experience involving PyTorch so either I am going to have to learn or hopefully others with experience can contribute as well.

Thank you

Patrick · November 9, 2023, 7:14pm

AI Based Stylometry Defense might be the way to go?

Added a new wiki chapter just now:
AI Based Stylometry Defense

Patrick · November 9, 2023, 7:35pm

But is Dolly 2.0 really Open Source? Hence ticket was created:

github.com/databrickslabs/dolly

step by step instructions on how to build this AI from source code

opened 05:26PM - 26 Oct 23 UTC

adrelanos

For a project to be considered Truly [1] Open Source project instructions on how… to build it from scratch, from source code are required. What I mean by that... The following style... > 1. download the training data > > download link > > 2. install the build dependencies > > A > B > C > > 3. get the source code > > 4. run this build script > > 5. done > > The LLM (large language model) file has been created. > > 6. Start using the LLM. > > Do x, y, z to get a prompt or follow the institutions here on how to interface with the LLM file. * For the build documentation, please refer to the cloud as little as possible. The essential information is to do without a cloud. * Also basic commentary how long the build process approximate took for you, what hardware you used, how much it cost you would be good. * Before anyone is saying "you cannot build this from source code because you don't have the infrastructure anyhow". That might be correct but even if I don't have it, somebody who has the requirements still needs the instructions on how to do it locally. * It's arguable how simple the instructions have to be. I am a fairly technical person, a Linux distribution maintainer and I must say I am lost at hello. Some Debian Linux developers have indicated interest to package LLMs. ([ref](https://lists.debian.org/debian-project/2023/02/msg00017.html), many more refs on request) But of course, Debian developers would need to be able to independently reproduce the LLM from source code from the "smallest reasonable building blocks" (my words), i.e. build documentation + data + AI source code + build scripts. ---- [1] It is sad that the word "Truly" has to be prepended because third-parties used the term Open Source but all they provided was a huge binary blob (the LLM file) without build documentation, training data, AI source code. Also OSI (Open Source Initative) shares this concern, see [What does it mean for an AI system to be Open Source?](https://deepdive.opensource.org/wp-content/uploads/2023/02/Deep-Dive-AI-final-report.pdf). If this is Dolly is Truely Open Source, I am applauding your efforts. Disclaimer: I am not a spokesperson for any of the mentioned projects.

No acknowledgement after 2 weeks. Therefore submitted the contact form just now.

Is Dolly 2.0 really Open Source?

Please acknowledge on github:
https://github.com/databrickslabs/dolly/issues/212

The term “Open Source” is often misused for AI. I (among others) wrote an article on that:
https://www.kicksecure.com/wiki/Artificial_intelligence

Also a researcher saying Dolly RL weights are not published:
https://opening-up-chatgpt.github.io/

Sent.

Thank you for contacting Databricks
We will get in touch with you shortly.

https://opening-up-chatgpt.github.io/

h3xagonal · April 15, 2024, 1:46am

Yes. I have not researched this in some time but I was never able to utilise the pre-trained A4NT models to reproduce the results in their paper due to issues with the implementation. I was considering writing an issue report and contacting the original authors for comment however that idea was on pause for some time. Since there has been rapid advancement in text-processing with AI language models investigating the throughput and results of a newer CPU bound language model would be a better task.

New fad in the AI ecosystem is to describe models as open-source when the only attribute that is open-source is the runtime and perhaps a pre-trained model blob that can be utilized locally. Do not like it either.

Thank you