A person's hand using a smartphone

POPri Approach Uses Synthetic Data, User Feedback on Bridges-2 to Avoid Exposing User Data in Training Large AI

Large language models (LLMs) have revolutionized the ability of AI to provide correct, fast answers to questions. But in the push to personalize LLMs to users’ preferences and data, a major roadblock arises. The best LLMs are generally too large to fit, let alone train, on users’ personal devices, where their private data resides. On the flip side, sending users’ private data from their personal devices to a central server poses clear privacy risks. A team led by Carnegie Mellon University and the Pittsburgh Supercomputing Center used PSC’s flagship Bridges-2 supercomputer and Delta at the National Center for Supercomputing Applications (NCSA) to develop a way to draw on the power of a central LLM without directly accessing any data from user’s devices.

WHY IT’S IMPORTANT

We’ve all been there. You say, “Siri, find the clip from Monty Python where they sing about Spam.” In return, you get lots of sites selling the tasty, if not exactly healthy, processed meat product by the case. Not quite what you wanted.

Products like Google’s GBoard and Apple’s mobile automatic speech recognition system use a process called federated learning (FL) to train themselves to respond correctly. The AI “lives” on your device, training itself on your data. It then sends its refinements to a central AI, which offers improvements to all the little AIs on all the users’ devices.

FL has two chief advantages. First, because the data stays on your device, it doesn’t expose your data to potential security breaches. It’s also fast. You don’t need to wait for your phone to talk with a computer on the other side of the country to get your answer.

But you can only fit so large an AI onto a mobile phone. Modern LLMs generate more accurate results. But they’re way too large to run on a phone, tablet, or laptop.

“Today, a lot of organizations want to train machine learning models from data that is stored on individual client devices. The challenge is that that data is often private … What we’ve been trying to do is understand whether you can instead … do all of your model training at the central server [without using private data].”

— Giulia Fanti, CMU

Giulia Fanti, the Angel Jordan Associate Professor of Electrical and Computer Engineering at CMU, wanted to combine the speed and security of FL with the predictive power of a publicly available LLM. To develop their new approach, called POPri (Preference Optimization for Private Client Data), her recently-graduated Ph.D. student Charlie Hou teamed up with Yige Zhu, Daniel Lazar, and Mei-Yu Wang, a machine learning research scientist at PSC. The team’s tool for building and training their AI was PSC’s NSF-funded Bridges-2 and the Delta system at NCSA. They got access to these systems via the National AI Research Resource Pilot Program and the NSF’s ACCESS network of supercomputers.

HOW PSC HELPED

POPri (pronounced like potpourri) uses a fresh take on an algorithm called private evolution, which is used to generate private synthetic data. The team first created synthetic data. The idea was that the LLM would generate fake-but-realistic data. User feedback, in the form of accepting or rejecting answers, would be used to improve the synthetic data without actually exposing the user’s private data. The synthetic data would train an LLM based on the Bridges-2 supercomputer, and then the LLM would teach a simpler AI on the user’s device to give similarly good answers.

The computational power needed to make it all work would be hefty. The computer would need to work with and store several versions of the LLM, and the volume of data running to, from, and within the system would be large. The project played to two of Bridges-2’s strengths.

“This is an iterative process … we’ll try to explore how long the process should go on to reach the best result. [Because of that] we’re working on a model that … is about 32 gigabytes [in size and] we need to store multiple copies of the model. In terms of GPU compute time and also storage, these are both substantial.”

— Mei-Yu Wang, PSC

First, it would need many high-capability, AI-friendly graphics processing units (GPUs). With 360 late-model GPUs, compared with one or two in a high-end laptop, Bridges-2 had that need covered. Second, the LLMs would need to store and move a lot of synthetic data quickly, so that the computations didn’t get bogged down in bottlenecks. Bridges-2’s data management system, Ocean, can handle 15 petabytes of usable data, delivering up to 129 gigabytes per second to read data and 142 GB/s to write it. That’s 7,500 times as much storage as on that high-end laptop, and more than a thousand times the speed of a professional Internet download. The team made similar use of Delta.

When tested, the CMU/PSC team’s POPri approach improved on alternative methods. In particular, it closed the gap in correct answers between fully private methods, which use only pre-trained public LLMs, and employing the user’s data in an open, non-private way, by up to 68 percent. It also improved on state-of-the-art FL methods by 10 percent. The team reported their results at the Proceedings of the 42nd International Conference on Machine Learning in Vancouver, Canada, in July 2025 . They also received an honorable mention for the best paper award at a workshop titled “Will Synthetic Data Finally Solve the Data Access Problem?”, which appeared at the Thirteenth International Conference on Learning Representations in April 2025.

“Access to these computational resources has been a game changer in terms of the research that we’re able to do. These experiments require significant storage and computational resources, and being able to access these resources has really been a huge enabler for our research.”

— Giulia Fanti, CMU

The team would like to continue refining POPri to get even more accuracy. For example, while their current work focuses on English-language text, many organizations need to learn from private tabular or time series data. Optimizing private synthetic data in those data domains is a challenging but important problem for the future. The team is also interested in evaluating the performance of POPri across diverse settings and addressing challenges that may arise in real-world applications.