Wednesday, October 18, 2023

Generative AI and Intellectual Property

Since the launch of ChatGPT, Generative Artificial Intelligence and Large Language Models have gained an extraordinary popularity and agency in a very short amount of time. As we are all playing around with the most approachable use cases to generate texts, images and videos, governments, global organizations and companies are busy developing the technology; and racing to harness the early mover's advantage this disruption will bring to all areas of our society.

I am not a specialist in the field and my musings might be erroneous here, but it feels that the term  Gen AI might be a little misguiding, since a lot of the technology relies on vast datasets that are used to assemble composite final products. Essentially, the creation aspect is more an assembly than a pure creation. One could object that every music sheet is just an assembly of notes and that creation is still there, even as the author is influenced by their taste and exposure to other authors... Fair enough, but in the case of document / text creation, it feels that the use of public information to synthetize a document is not necessarily novel.

In any case, I am an information worker, most times a labourer, sometimes an artisan but in any case I live from my intellectual property. I chose to make some of that intellectual property available license free here on this blog, while a larger part is sold in the form of reports, workshops, consulting work, etc... This work might or not be license-free but it is in always copyrighted, meaning that I hold the rights to the content and allow its distribution under specific covenants.

It strikes me that, as I see crawlers go through my blog and indexing the content I make publicly available, it serves two purposes at odds with each other. The first, allows my content to be discovered and to reach a larger audience, which benefits me in terms of notoriety and increased business. The second, more insidious not only indexes but mines my content to aggregate in LLMs so that it can be regurgitated and assembled by an AI. It could be extraordinarily difficult to apportion an AI's rendition of an aggregated document to its source, but it feels unfair that copyrighted content is not attributed.

I have playing with the idea of using LLM for creating content. Anyone can do that with prompts and some license-free software, but I am fascinated with the idea of an AI assistant that would be able to write like me, using my semantics and quirks and that I could train through reinforcement learning from human feedback. Again, this poses some issues. To be effective, this AI would have to have access to my dataset, the collection of intellectual property I have created over the years. This content is protected and is my livelihood, so I cannot part with it with a third party without strict conditions. That rules out free software that can reuse whatever content you give it to ingest.

With licensed software, I am still not sure the right mechanisms are in place for copyright and content protection and control, so that I can ensure that the content I feed to the LLM remains protected and accessible only to me, while the LLM can ingest other content from license free public domain to enrich the dataset.

Are other information workers worried that LLM/AI reuses their content without attribution? Is it time to have a conversation about Gen AI, digital rights management and copyright?

***This blog post was created organically without assistance from Gen AI, except from the picture created from Canva.com 

No comments: