Analysis: There are many places within IT infrastructure that organizations can use to get a single proprietary data source for their large language models and AI agents doing RAG (retrieval-augmented generation) – backup stores, data lakes, data management catalogs, and storage arrays. Let’s sketch the background then take a look at how they compare to each other.
LLMs and agents can scan all kinds of unstructured information in text, numbers, audio, image, and video formats in response to natural language inputs and generate natural language outputs. However, they are initially trained on general data sources and will make better responses to an organization’s users if they have access to the organization’s own data. But that data is spread across many disparate systems and formats.
For example, datacenters (mainframes and x86 servers), edge sites, public cloud instances, SaaS app stores, databases, data warehouses, data lakes, backups on disk and tape on-premises and in the public cloud, and archives. The data can be accessed via structured (block) or unstructured (file and object) protocols with many subsidiary formats: Word, Excel, PDF, MPEG, etc.
There are four theoretical approaches to getting all these different data sources available for RAG. First, you could go to individual specific sources each time, which would require you to know what they are, where they are and what they contain. This approach operates at the level of individual storage arrays or cloud storage instances.
Secondly, you could go to existing agglomerated data sources, meaning databases, data warehouses, and data lakes. Third, you could go to the backup stores and, fourthly, use a data management application that knows about your data.
These second, third, and fourth options already have metadata catalogs describing their content types and locations which makes life a lot easier for the RAG data source hunter. All else being equal, accessing one location to find, filter, select, extract, and move RAG data is better than going to myriad places. It makes AI data pipeline construction much simpler.
A couple of other observations. The more of your proprietary data the single place contains the better as you have to make fewer exception arrangements. Secondly, the LLMs and agents need vectorized unstructured data for their semantic searches, and this needs producing and storing. Any central RAG data source facility needs to support vectorization and vector storage.
That’s the background. Let’s take a gander at the four main RAG data sources.
Storage arrays
Virtually every storage array vendor we know, both hardware and software, software-defined, file and object, is building in some kind of Gen AI support. Cloudian, DDN, Dell, HPE, NetApp, Scality, StorONE, VAST Data, WEKA, and others are all piling in.
Arrays (or SW-defined storage) which have a fabric connecting on-premises arrays, and also public cloud instances with a global namespace will have an advantage; their reach is obviously greater than non-fabric arrays. This is true for cloud file services suppliers, such as Box, CTERA, Egnyte, Nasuni, and Panzura as well.
However, such vendors can only supply data for RAG stored on their systems and nowhere else, unless they have connectors giving them wider data access. An agent granted access to a Dell, NetApp, DDN, HPE, whatever array won’t be able to see data on another supplier’s array in that organization though the NetApp, DDN, HPE, whatever array RAG lens.
Database/warehouse/lake/lakehouses
A database, data warehouse, data lake and lakehouse store more information respectively and, generally, in more formats as we head from databases to data warehouses, then data lakes and lakehouses. At the database end of this spectrum, specialized vector databases are appearing, from suppliers such as Pinecone and Zilliz (Milvus). They say they offer the best of breed vector storage, filtering, extract, and support for AI pipelines.
Other databases aim to be multi-modal, like SingleStore. Its SingleStoreDB is a performant, distributed, relational, SQL database with operational, analytical, and vector data support, integration with Apache Iceberg, Snowflake, BigQuery, Databricks, and Redshift, and a Flow ingest feature.
Data warehouses are rapidly becoming AI data feed warehouses, witness Snowflake and its AI Data Cloud. Data lakes and lakehouses are also very much GenAI-aware and rapidly developing features to support it. As an example consider Databricks which recently added a Lakebase Postgres database layer to its lakehouse, enabling AI apps and agents to run analytics on operational data within the Databricks environment. It also introduced Agent Bricks, a tool for automated AI agent development.
These are all excellent products, and connectors exist to bring them data from multiple different sources or allow them to access external table software data.
Their operational and admin environment becomes more and more complicated as they extend their reach to external tables or use connectors to bring in data from distant sources
Backup stores
A backup vault stores data copied from multiple places within your organization. It can function as a single source for LLMs and agents needing access to your proprietary information. It’s not real-time information but it can be very close to that and it, obviously, has troves of historical information.
The backup vault storage media is an important consideration. Tape is obviously a no-no due to lengthy data access times. Disk is better while an all-flash vault is probably the fastest.
Cohesity, Commvault, Rubrik, and Veeam are well aware of this strength they possess, as a vast potential RAG data store, and building out features to capitalize on it. Cohesity has its Gaia initiative. Rubrik recently acquired Predibase for agentic AI development functionality.
An obvious caveat is that the backup vendors can only serve data to LLMs and agents that they have backed up. Anything else is invisible to them. This could encourage their customers to standardize on one backup supplier across their organization.
Data Managers
Data managers such as Arcitecta, Datadobi, Data Dynamics, Hammerspace, and Komprise almost never act as data sources. We say almost never because Hammerspace is edging towards that functionality with the Open Flash Platform initiative. The data managers manage data stored on multiple storage arrays and in public cloud storage instances. This partially reflects a hierarchical life cycle management background common to several of them.
The data managers don’t store data, but they do index and catalog it; they know what data their customers have, and they often have data moving capability, coming from a migration background in some cases. That means they can construct interface processes to AI data pipelines and feed data up them. Komprise is highly active here. Datadobi with its StorageMap, policy-driven workflows and data moving engine has a great base on which to build AI model-specific functionality.
Arcitecta’s Mediaflux product can be used to curate high-quality datasets from vast volumes of unstructured data, making it easier to feed data into large language models (LLMs) and other AI systems. We expect more functionality specific to AI to emerge here.
To expand their reach and give them more relevance here the data managers could usefully partner with the database/warehouse, lake and lakehouse suppliers. Hammerspace and Snowflake is an example of this.
Other data silos
SaaS apps such as Salesforce are another data source for RAG. They cannot see outside their own environment and, absent explicit connectors, other data sources, with one exception, cannot see into theirs. Their own environments – Microsoft’s, for example – can of course be large.
The exception is backup. If a SaaS app has its customer data backed up, by a Cohesity, Commvault, Druva, HYCI, etc., then that data is inside the backup vendor’s purview.
Summing up
There is no silver bullet here, no instant fix, no one all-embracing Gen AI RAG data source. You can only search out the best available Gen AI data source or sources for your organization considering your on-prem and public cloud IT data estate. All the suppliers mentioned are developing features, building partnerships and even acquiring companies that own relevant technology. They all have their merits and will have more next month, next year, and so on.
Wherever you are starting from you can perhaps place candidate suppliers into one of the four categories: storage, database/warehouse/lake/lakehouse, backup store and data management. That will give you a quick overall view of their scope. You can then, hopefully, decide which route or routes you might take to arrive at the most efficient way to give the LLMs and agents you use access to the proprietary data they need