The prowess of generative AI with text has brought immense value — from writing emails and answering questions to generating wedding speeches. AI models trained to deal with text, like large language models (LLMs), have powered this value and are only getting better at natural language.

Boost productivity with the most powerful AI and agents fueled by the context of all your business data

However, there are challenges when we move beyond text to apply these models to structured, tabular data, which is essential for enterprise business operations. This imbalance comes partly because of the availability of training data. Text used to train models is plentiful, often consisting of text scraped from the internet, whereas tabular data, especially data with multiple linked tables, is scarce.

To bring AI advancements to the enterprise sector, researchers working on training and benchmarking the performance of these models in an enterprise setting need realistic tabular data. That’s why SAP developed “Sales Autocompletion Linked Business Tables” (SALT), a curated dataset that includes anonymized data from a customer’s enterprise resource planning (ERP) system.

SALT is specifically designed to support researchers working on AI models for real-world business contexts and can be accessed on Hugging Face and GitHub.

Challenges of getting and working with enterprise data

Providing the research community with realistic enterprise data like SALT has been challenging. Data privacy, confidentiality, and commercial interests make obtaining large, clean, high-quality enterprise datasets difficult for training models and benchmarking them for specific use cases. This means there is a growing gap between what researchers are working on and what actual enterprise data looks like.

In addition to the problem of availability, enterprise data is complex. First, business data is usually stored in multiple interconnected tables. For example, a sales order entry may be linked to numerous tables, such as customer IDs connected to a supplier table containing address information. Second, tables are inherently heterogeneous in the data type they can contain. One field may be text, while the other contains numerical or categorical values. Finally, business data frequently shows significant column imbalances, meaning that, for example, a specific product category makes up 90 percent of all sales orders while others are rarely used.

The best way to help researchers develop enterprise models for these challenges is to provide accurate enterprise data.

SALT dataset

Accurate enterprise data is a bottleneck in AI research. The SALT dataset alleviates this bottleneck by providing the research community with the first real ERP dataset. It uses actual industry data collected by an ERP system that records sales orders. It has been minimally processed to protect privacy.

“There is a gap between academia and industry in terms of data. It cannot be closed easily because of privacy,” says Tassilo Klein, one of the SAP researchers behind the dataset. “But we want to enable the research community to work on real problems, not just simulated problems.”

ERP systems help organizations manage core business operations like finance and spending. With millions of entries and extensive, interconnected relational tables focused on sales, the SALT dataset replicates customer interactions in an ERP system. SALT’s realistic enterprise data means it is a perfect basis for helping models understand the characteristics of business data and validate their performance through benchmarking. It also should help researchers develop better foundation models for linked business data.

Getting this right will advance enterprise automation, as many enterprise business processes are heavily centered around data in structured tabular formats. Even though this data plays a crucial role in enterprise day-to-day activities, the generative AI revolution has yet to tap into them.

“SALT is a first step to providing researchers with authentic representative industry data that gives a glimpse into actual enterprise data; for now, we are starting with just one customer and use case,” shares Johannes Hoffart, CTO of Business AI at SAP. “However, we plan to publish more datasets that cover a diverse set of customers and use cases that, along with SALT, can serve as a basis for pre-training, adapting, as well as benchmarking models.”

Collaboration with academic institutions is also a motivation for publishing this data.

“At SAP, we hope to collaborate with academic partners who usually can only publish their results on open repositories,” Klein says. “Another hope for the dataset is encouraging more people to explore and validate new methods that help foundation models better deal with tabular enterprise data.”

What SAP is doing

Alongside its investment in the open research community with SALT, SAP is building SAP Foundation Model to handle enterprise tabular data. This table-native AI model aims to accelerate time-to-value for predictive tasks on tabular data, offering a model that can work with tabular data out-of-the-box with little or no additional training data. The PORTAL paper, published alongside SALT, provides a first glance at how this model could look.

Knowledge graphs are critical here. They work by exposing metadata — the who, what, and when of data — making relationships between information accessible. This provides a structured, interconnected representation of the data that AI models can easily understand and utilize. With the help of SAP Knowledge Graph, SAP Foundation Model can be scaled and adapted to a wide array of diverse use cases with some lightweight fine-tuning.

Learn more about:

Subscribe to the SAP News Center newsletter to get news and highlights delivered straight to your inbox each week