Paper
AP2D - ACM Papers Session 2D
Live streaming
The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval-Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering (Q/A) benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA—outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
The emergence of foundational models and generative artificial intelligence (GenAI) is poised to transform productivity in scientific computing, especially in code development, refactoring, and translating from one programming language to another. However, because the output of GenAI cannot be guaranteed to be correct, manual intervention remains necessary. Some of this intervention can be automated through task-specific tools, alongside additional methodologies for correctness verification and effective prompt development. We explored the application of GenAI in assisting with code translation, language interoperability, and codebase inspection within a legacy Fortran codebase used to simulate particle interactions at the Large Hadron Collider (LHC). In the process, we developed a tool, CodeScribe, which combines prompt engineering with user supervision to establish an efficient process for code conversion. In this paper, we demonstrate how CodeScribe assists in converting Fortran code to C++, generating Fortran-C APIs for integrating legacy systems with modern C++ libraries, and providing developer support for code organization and algorithm implementation. We also address the challenges of AI-driven code translation and highlight its benefits for enhancing productivity in scientific computing workflows.
Federated finetuning is essential for unlocking the knowledge embedded in pretrained Large Language Models (LLMs) when data is distributed across clients. Unlike single-institution finetuning, federated finetuning enables collaboration across decentralized datasets while preserving data privacy. To address the high computing costs of LLM training and improve energy efficiency in Federated Learning (FL), Low-Rank Adaptation (LoRA) has gained popularity due to its reduced number of trainable parameters. However, this approach assumes all clients have sufficient computing resources, which is often unrealistic due to the heterogeneity of resources across clients. While some clients may access powerful GPUs, others have limited or no such resources. Federated finetuning using synthetic data allows participation without local LLM training but introduces a performance gap compared to local updates. To address this, we propose a novel two-stage algorithm leveraging the storage and computing power of a strong server. In the first stage, resource-constrained clients generate synthetic data under the coordination of the strong server, which is stored on the strong server. In the second stage, the strong server uses this synthetic data on behalf of constrained clients to perform federated LoRA finetuning alongside clients with sufficient resources. This ensures participation from all clients. Experimental results demonstrate that incorporating local updates from even a small fraction of clients improves performance compared to using synthetic data for all clients. Additionally, we integrate the Gaussian mechanism in both stages to ensure client-level differential privacy.