Better unstructured data management is the reason Komprise was founded in 2014 by CEO Kumar Goswami along with president and COO Krishna Subramanian and CTO Michael Peercy. At that time, file populations numbering in the millions were appearing in large enterprises and, since then, have risen to the billions. Objects stored in buckets are numbered in the trillions at the hyperscale public cloud providers. Komprise works part of its data management magic by using and enriching metadata about files and objects.
For example, media files can have added metadata to describe their contents. In the last few years, generative AI’s large language models require vector embeddings to perform semantic search, and such vectors are generated from unstructured data, from the content. Are vectors a kind of metadata? We explored these topics with Goswami in an interview.
Blocks & Files: I could argue that the tokens and vector embeddings generated from a data item are metadata. What do you think about this idea?

Kumar Goswami: Metadata and vector embeddings are complementary but related. Since vector embeddings are a computer-understandable representation of file contents (“the what”) while metadata is valuable information about the file that can go well beyond file contents (“the why”), you need both. Metadata is usually more concise than vector embeddings and putting the entire file contents into metadata can be inefficient. Also, there could be data governance issues with running AI on all your data via embeddings.
For example, say you want a chatbot to answer questions based on the most recent product features but you want it to only use public facing documents and not confidential internal documents, you should use metadata to exclude internal documents and non-final versions and run the vector embeddings and AI on just the right files. We are focusing on gathering and globally managing metadata to enrich, inform, and narrow down data, not to capture everything that can be gleaned from it.
We want to empower other tools and processes to consume and process the data as a whole. For example, you can enforce AI data governance and improve AI data quality by using Komprise to cull the files fed to Nvidia NeMo for embedding and running inferencing.

Blocks & Files: Komprise says new tools can automatically analyze file contents and generate semantic tags at scale. What are semantic tags? Are they metadata generated from a file’s contents? If so, then how do these semantic tags differ from vector embeddings?
Kumar Goswami: Vector embeddings are used to help AI understand the meanings of words in context while metadata provides semantic context for which files are relevant. For example, vector embeddings may help AI understand that the word “award” in the context of a research grant paper means getting a funding award and not winning a trophy. Metadata can be used to cull and curate all the documents related to a specific research topic by a specific researcher in a specific time frame to send to an AI agent that is helping write a grant application. You can argue that both are semantic contexts, but for different purposes, and metadata is broader than what is in the file itself.
Blocks & Files: What tools exist that automate finding and analyzing metadata?
Kumar Goswami: You need to not only index metadata across different storage and cloud environments but also act on it at scale. Komprise does both as our analysis extracts both system metadata and extended metadata such as sensitive data information into a global file index. This index retains the knowledge no matter where your data lives, and it does so without changing the original files. Komprise Deep Analytics helps you query and filter data based on this index and Komprise Smart Data Workflows allows you to search and feed the right data to the right AI process and retain its outputs as additional metadata.
That’s the neat thing about metadata and AI: it is not a one-and-done process like traditional ETL. Instead, you need an ongoing workflow solution to find the right data, get it to the right compute, run the compute either locally or in the cloud, and then repeat this process again. Our customers have indexed and mobilized over an exabyte of data using Komprise. You can use any AI or vector embedding or processor to enrich metadata further on your data in Komprise workflows. A great example of this is our customer Duquesne University.
Blocks & Files: What AI tools are now available to extract pertinent information hidden in files and turn it into useful metadata that adds structure and context? How is the synthesis carried out?
Kumar Goswami: Anything that looks at file contents and generates outputs can be used via APIs in Komprise to enrich metadata. You can use cloud-based services like Azure AI Speech to inspect audio or Salesforce Einstein to find particular purchase orders in your CRM, and then have Komprise tag the files. That is the beauty of iterative workflows. You can use any process or tool to distill relevant metadata once you have a systematic way to manage the workflow.
Blocks & Files: I understand Komprise thinks that automatic metadata from storage systems, while useful for basic operations, is just the start of a strategic metadata management program. The real business value comes from enriching this foundation with metadata that precisely defines data so it can be easily searched and moved as needed to AI tools or other locations as required. What metadata enriches the automatic metadata from storage systems? How is it generated? How is it stored and indexed?
Kumar Goswami: There are many types of additional metadata, some of which are shown below. You could have users manually apply additional tags based on their knowledge. And, you can systematically automate applying tags at scale based on the artifacts from other processes as we have explained in prior answers. Enriched metadata becomes part of the data stored and indexed by an unstructured data management system. Such systems must be able to handle the scale of billions of metadata tags and persist these tags wherever the data lives and moves, to be effective. Komprise can do this today.
Contextual metadata: Project identifiers, geographical tags, departmental associations, and business context that give meaning beyond technical properties. Some of this information can be extracted from applications, some from headers in files, and some via APIs from related applications (like getting the account identifier for a proposal from the CRM system).
Sensitivity metadata: PII, intellectual property, regulated data type and security classifications. This requires specialized tools to uncover and classify, as it involves analyzing file contents rather than just properties.
User-based metadata: Manual tags, collaborative annotations and crowd-sourced insights that add human intelligence to data classification. While powerful, this approach faces scalability challenges as data volumes explode.
AI-generated metadata: The newest and most transformative category. AI analyzes file contents and automatically generates contextual tags and classification insights at scale.
Blocks & Files: How can Komprise automatically identify and classify data based on business value, access patterns, and project requirements? That way, you can store data in the right place at the right time without wasting precious resources.
Kumar Goswami: Komprise offers automatic identification of sensitive data today in product, whether that is PII or keyword/regex search for a custom query. We can work with any third-party AI tool to scan for different data types that uniquely identify data contents with tags that departmental users and data scientists need for projects. Culling and feeding the right data to AI is very important, regardless of whether the AI runs locally or in the cloud for three key reasons: a) it can be very costly to copy a lot of unnecessary data across environments, b) you don’t want to run expensive AI compute on irrelevant data or repeatedly on unchanged data, c) but most importantly, feeding the wrong data to AI could create data leakage and inaccurate results.
With AI, more isn’t always better. Komprise uses policy-driven workflows to manage this entire lifecycle from searching for the right data, moving it to the right place, extracting relevant outputs as metadata tags and then deleting or tiering off the data when done. And our automation can automatically do this as new data arrives, eliminating the manual overhead of right-sizing AI. Our customers find they can cut their AI storage and compute costs by 85 percent or more using Komprise.
Blocks & Files: How can Komprise help data scientists quickly discover relevant datasets, understand data lineage and ensure compliance with governance requirements?
Kumar Goswami: We’ve covered the first point above with metadata enrichment enabling rapid search and curation of precise data sets. As Komprise moves the data to AI, it maintains an audit of what information was sent, and it tracks the lineage of where the data has been moved, and where it came from. Companies can get an audit trail for data governance purposes. Increasingly, data governance is not just to comply with government regulations but a corporate priority to prevent data leakage of corporate information.
Komprise also offers sensitive data detection and mitigation, orphaned and duplicate data search and deletion, and the ability to automate data management policies for different use cases such as cold data tiering to immutable storage for ransomware protection or to ensure data that must adhere to regulations such as HIPAA and GDPR is stored and protected appropriately. Setting up a Deep Analytics query to identify these protected data sets (PII, PHI) and automatically act on them if they are not handled properly by confining them, sending them to compliant storage and deleting them per regulatory requirement timelines, are just two examples.
Blocks & Files: Komprise says organizations need to rapidly identify and protect their most critical data assets from ransomware. How can Komprise help them do that, identify their most critical data assets? What does “most critical” mean?
Kumar Goswami: Most organizations struggle to protect unstructured data against ransomware attacks because keeping many copies of petabytes can be prohibitively expensive. So, we help our customers right-size their ransomware defense by helping them tier cold data to immutable storage where it is protected for a fraction of the cost, while enabling them to use more active ransomware defense and recovery for their active and hence more business-critical data.
Blocks & Files: Komprise says sensitive data detection through metadata tagging for “PII” and other keywords helps find protected data that may be stored in non-compliant locations and secure it properly against cyberattacks. Can Komprise automate this process?
Kumar Goswami: Yes! You can select the file shares and directories to search, and then Komprise will scan them for any data that is PII such as names, birth dates, user IDs, driver’s license, social security numbers, credit card numbers, addresses. You can also use regex/keyword search to find IP data or other data deemed sensitive to your organization that doesn’t fit any standard definitions and this could include EmployeeID, PatientID for example. You can then use a Smart Data Workflow to take additional actions, such as to confine the data sets for manual review for legal hold or deletion and/or automatically move them to secure storage.
Blocks & Files: Komprise says the ability to scan file shares across vendors and automatically tag sensitive data types for appropriate action is a game changer. Too often, data gets copied and/or moved to locations where it is not adequately protected based on policies and regulations. How can Komprise help here? By automating the process and making it policy-driven?
Kumar Goswami: Yes, I believe we have answered this question. IT can set this up as an automated policy to run daily, weekly or at whatever occurrence is desired. Storage leaders are concerned about the risk of sensitive data lurking where it shouldn’t be and getting exposed inadvertently to AI. Komprise helps them identify and mitigate this risk.
Bootnote
You can download a Komprise technical overview here.