AI training data and copyright – a legal minefield
Published
Written by
Read time
Written by Tom Montague - Sales Director.
The UK is emerging as a leader in AI: home to more than 1,800 VC-backed AI start-ups, it now boasts 20 unicorns, including Synthesia, Wayve, and Stability AI.
Many of these start-ups are creating AI models that learn from existing data; for example, retail systems trained on past transactions, location or customer demographics.
However, often the data being used for this training is scraped or downloaded from the internet – and here, the legal situation can start to get a bit murky.
In the US, there are a number of legal cases in progress, following complaints from copyright holders unhappy that their content is being used without payment. Music publishers including Concord and Universal are suing AI start-up Anthropic, for example, for using song lyrics without its consent.
In the UK, there's currently a lawsuit pending over the use of more than a million Getty Images pictures, which, says the firm, have been used for training by Stability AI without permission.
Meanwhile, British parenting forum, Mumsnet, says it's launching legal action against OpenAI over alleged data scraping of its site: a breach of its terms of service, it says, as well as a breach of copyright.
The reason for all these lawsuits, of course, is that this is largely uncharted territory. And while governments are sympathetic to copyright holders, they're also keen to support the highly-profitable AI industry and legislate with a light touch.
Last summer, for example, the then UK government announced plans for a voluntary code of practice for AI and copyright, saying it wanted to promote and reward investment in creativity, while at the same time helping the UK to be a world leader in research and AI innovation.
However, in January, attempts to agree this failed. And this means that the regulatory situation remains up in the air, with a Commons report published in April warning that the government was failing to protect content creators.
Since then, of course, a new Labour government has taken power, with the King's Speech including a commitment to 'seek to establish the appropriate legislation to place requirements on those working to develop the most powerful artificial intelligence models'.
So, what might this mean in practice, and how can AI start-ups make sure they're staying within the law?
Secretary of State for Science, Innovation, and Technology, Peter Kyle, has said the government is now planning to introduce a statutory code for AI developers, requiring them to carry out safety tests with independent oversight, share testing data with the government, and keep it posted on the capabilities of their models.
And while there are as yet no details, any legislation is likely to at least be similar to the provisions of the recently-introduced EU AI Act. This specifies that any use of copyright-protected content requires the authorisation of the rights holder, unless certain, very limited, copyright exceptions apply.
And the EU is already flexing its muscles in this regard. Earlier this year, Meta was forced to put its AI training plans on hold following a request from the Irish Data Protection Commission (DPC), while a series of complaints have been raised against X over similar issues.
"When considering scraping data for AI training, the safest approach is to obtain and train on wholly owned or appropriately licenced works, data, and information to minimise risks of any legal claims," advises Rebecca Steer, a partner at law firm Charles Russell Speechlys.
"If you are looking to licence the model to users, it will mean you are able to offer appropriate copyright indemnities and warranties to users, compared to other tools which offer little or no protection."
It's even possible to access a one-stop-shop for AI training data, with a number of companies now aggregating content into large collections for AI platforms to license; several recently formed a trade association, the Dataset Providers Alliance, offering deals on music, image, video, and other datasets.
And it's worth remembering that there are a number of sources of copyright-free or readily-licensable data: Creative Commons, copyright-lapsed data, commercial libraries such as Shutterstock, other institutional repositories, and research-orientated datasets, for example.
The outcome of the cases against Meta and X, along with those brought by Getty and Mumsnet, may make things clearer. In the meantime, though, AI firms are advised to take a cautious approach.
If you’d like to understand more about what Howden could do for your business, visit howdenbroking.com/technology.