Some thoughts on anti-stylometry research

Greetings

I am a researcher interested in developing practical tools for improvement of operational security (instead of proposals that get lost in academic papers or theory and never see the light of day)

One path I have interest in is the art of stylometry and its counter-measures. I have developed a prototype of a tool that can automatically iterate paragraphs of text through a online translation interface chained through multiple languages in attempt to destroy any linguistic clues because the fact sentiment analysis of machine translation is rather ‘lossy’.

That is not a best practice because it involves disclosing plain-text to an untrusted party and this is the drive for a client-side solution.

I have seen the 4ANT project (A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation | USENIX) and some of its code (GitHub - rakshithShetty/A4NT-author-masking: Repository for author masking), but a issue with that is there is no documentation on actually using it. There is an old saying about academic research projects (paraphrasing) ‘You have the code and a paper go figure it out’ and it appears to be the situation here.

Maybe someone with enough time can correlate certain actions described in the paper with the code and get it working that way.

In any case. I would be of interest hearing about your experience.

Regards

2 Likes

From what I’ve seen in research findings, this technique is not adequate and also has the downside of producing unintelligible text. A better way is to have a user compare their new text to samples of older text they shared before and have the tool clear it as being different.

If you have time to help us integrate an existing tool that does what I mentioned let us know.

1 Like

Yes. I also agree attempted mutation by preexisting machine-translation methods is not a good solution.

A better way is to have a user compare their new text to samples of older text they shared before and have the tool clear it as being different.

The A4NT whitepaper with its classification appears to demonstrate the semantics of such differentiation and ‘clearance’.

If you have time to help us integrate an existing tool that does what I mentioned let us know.

The A4NT whitepaper seems to be the most complete solution in terms of analyzing trade-offs. The code is there but with minimal documentation. Unfortunately I also have minimal experience involving PyTorch so either I am going to have to learn or hopefully others with experience can contribute as well.

Thank you

1 Like

AI Based Stylometry Defense might be the way to go?

Added a new wiki chapter just now:
AI Based Stylometry Defense

related:

But is Dolly 2.0 really Open Source? Hence ticket was created:

No acknowledgement after 2 weeks. Therefore submitted the contact form just now.

Is Dolly 2.0 really Open Source?

Please acknowledge on github:
https://github.com/databrickslabs/dolly/issues/212

The term “Open Source” is often misused for AI. I (among others) wrote an article on that:
https://www.kicksecure.com/wiki/Artificial_intelligence

Also a researcher saying Dolly RL weights are not published:
https://opening-up-chatgpt.github.io/

Sent.

Thank you for contacting Databricks
We will get in touch with you shortly.


Yes. I have not researched this in some time but I was never able to utilise the pre-trained A4NT models to reproduce the results in their paper due to issues with the implementation. I was considering writing an issue report and contacting the original authors for comment however that idea was on pause for some time. Since there has been rapid advancement in text-processing with AI language models investigating the throughput and results of a newer CPU bound language model would be a better task.

New fad in the AI ecosystem is to describe models as open-source when the only attribute that is open-source is the runtime and perhaps a pre-trained model blob that can be utilized locally. Do not like it either.

Thank you