the spurious infinite
we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT) […]
The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set. […]
Others raised significant privacy concerns. Two sites in the top 100, coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.Business and industrial websites made up the biggest category (16 percent of categorized tokens), led by fool.com No. 13, which provides investment advice. Not far behind were kickstarter.com No. 25, which lets users crowdfund for creative projects, and further down the list, patreon.com No. 2,398, which helps creators collect monthly fees from subscribers for exclusive content.
Kickstarter and Patreon may give the AI access to artists’ ideas and marketing copy, raising concerns the technology may copy this work in suggestions to users. […] The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.
The News and Media category ranks third across categories. But half of the top 10 sites overall were news outlets: nytimes.com No. 4, latimes.com No. 6, theguardian.com No. 7, forbes.com No. 8, and huffpost.com No. 9. (Washingtonpost.com No. 11 was close behind.) […] RT.com No. 65, the Russian state-backed propaganda site; breitbart.com No. 159, a well-known source for far-right news and opinion; and vdare.com No. 993, an anti-immigration site that has been associated with white supremacy. […] Among the top 20 religious sites, 14 were Christian, two were Jewish and one was Muslim, one was Mormon, one was Jehovah’s Witness, and one celebrated all religions. […]
The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. Publishing platform medium.com No. 46 was the fifth largest technology site and hosts tens of thousands of blogs under its domain. Our tally includes blogs written on platforms like WordPress, Tumblr, Blogspot and Live Journal.