Data and ai for non-tech

Dec 6

This article addresses the Data hurdle in my Moving Forward with AI article.

Eager to unlock the transformative potential of AI/ML technologies? Start with your data.

Many proof-of-concepts struggle to scale often because of data challenges. The journey from raw data to actionable insights is complex, especially when adopting artificial intelligence (AI) and machine learning (ML) technologies to deliver real-world impacts.

Success hinges on integrated data and AI strategies that support business objectives.

Data is inherently technical and its landscape is constantly evolving. This primer aims to help non-tech managers and decision-makers digest the core data concepts most relevant to AI/ML applications, empowering you to collaborate effectively with IT and data teams to make informed decisions together and drive results with confidence.

While technology, data quality, and data governance are often discussed separately, we believe they should be considered together to deliver a sustainable solution.

We will cover:

Storage and Infrastructure:

1. Evolving Data Management and Emerging “DataCo”

2. Computing Infrastructure: Cloud, Hybrid, and Edge Computing

3. Data Storage: Warehouses, Lakes, Open Table Formats, and Lakehouses

Data Pipelines and AI Readiness:

4. How Data Flows: From Data Ingestion, Pipeline, Integration to Analytics

5. Utilizing Data: Data Cataloging, Mesh, and Data Security

6. Specific to AI: Features and Enterprise Feature Stores

7. Emerging AI-Enabling Technologies: Data Fabric, Semantic AI, AutoML, and Zero-Copy

Data Quality & Governance:

8. Data Quality and Governance: AI-ready Data

9. Data Transparency, Explainability, and Fairness: Building Trust in AI

Moving Forward

Let’s demystify data and lay the groundwork for your AI/ML success.

Storage and Infrastructure

1. Evolving Data Management and Emerging “DataCo”

Data management is evolving rapidly driven by tech advancements and growing pressure to harness value from data in the age of AI.

The role of a data analyst has evolved into that of a data scientist, expanding beyond statistical analysis to include the development of AI/ML models that predict trends and generate insights. Data Engineers now manage big data technologies, cloud platforms, and real-time data processing while maintaining systems for data capture, storage, and reporting. Data Stewards, custodians for data governance to ensure data quality, compliance, and proper use within an organization, now address data privacy and security issues and must stay conversant with regulations. Organizations are also adopting newer approaches to maximize the value of data as an asset, e.g., enriching internal datasets with external data and formalizing data as products.

Organizations are also getting creative with their data strategies. One notable example is the emerging concept of “DataCo,” an independent entity that manages and leverages data separately from core operations of the parent company. Popularized by the EDM Council, a well-regarded global trade association that connects data professionals, regulators and industry leaders, DataCo is intended to drive the execution of an organization’s most advanced approaches to managing data regardless of data maturity and incorporating business, financial, legal, and risk management considerations more effectively.

Even as roles and tech evolve, the key concepts of data management remain relevant when discussing data:

Metadata: Data about the data (e.g., field names, formats, definitions); makes data easier to locate, understand, and integrate across systems.
Golden Source: a single, authoritative source of truth, ensuring consistency across systems.
Critical Data Element (CDE) also called Key Data Element (KDE): critical data for operations, compliance, and decision-making – often prioritized in data strategy.
Data Lineage: the life story of data – where it comes from, how it’s transformed, and where it ends up; vital for ensuring quality, compliance, and accountability.

Although technical, understanding the basics of the organization’s data architecture - the systems that manage databases, storage, and data flows - particularly where “technical debts” or underinvestment exist, is critical in making informed, timely decisions.

2. Understanding Computing Infrastructure: Cloud, Hybrid, and Edge

Cloud technology emerged in the early 2000s and revolutionized computing by enabling companies to access computing power, storage, and networking services via the internet. This shift significantly lowered barriers for smaller businesses to engage in digital commerce.

By the 2010s, advances in internet speed and cloud platforms led to the rise of Software as a Service (SaaS), Infrastructure as a Service (IaaS), and Platform as a Service (PaaS). However, the need to maintain certain critical operations and sensitive data on-premises gave rise to the popular hybrid cloud model we see today.

Hybrid clouds have continued to evolve, now often incorporating edge computing, where data is processed locally on devices, with only essential insights sent to the cloud for broader analysis and storage. This approach effectively enables AI/ML models to be trained in the cloud while deployed on-premises or at the edge to minimize latency (delay in data transmission) and protect privacy.

We also see companies increasingly adopt multi-cloud hybrid strategies to avoid vendor lock-in and strengthen data protection through redundancy with unified security frameworks such as Zero-Trust Architecture and Cloud-Native Security Solutions to protect both cloud and on-prem environments.

A key hybrid-enabler is containerization that packages applications, libraries, system tools, code, etc., into a lightweight, standalone executable software - a container - that allows applications to run consistently across hybrid environments.

3. Data Storage: Warehouse, Lake, Open Table Formats (OTF), and Lakehouse

Storage determine how quickly data can be accessed, processed and utilized, which directly impacts the scalability of AI projects. The storage technologies have evolved in response to the growing complexity of data - often referred to as the “new oil” waiting to be “refined” from raw material in transactional databases, the internet and mobile phones into valuable insights. We will discuss the fundamentals then touch on newer trends.

Data Warehouses: Stability and Simplicity for Structured Data

Since the 1990s, data warehousing technology was developed to provide integrated and governed data to produce insights that are trusted and reliable. Warehouses aggregate data from siloed transactional systems, e.g., CRM and ERP, into a single repository. This creates a single version of truth, enabling reporting and analytics through structured query language (SQL). These relational databases excel at handling structured (tabular) data and remain vital for data-driven decision-making even today due to their stability, simplicity, and low cost.

For smaller teams or non-technical functions, data marts enable them to perform model training and analytics quicker with fewer resources and without the complexity of a full data warehouse. These small-scale repositories can be built independent of or as an extension of a data warehouse.

Data Lakes: Flexibility for Unstructured Data

By the 2010s, Data Lakes emerged in response to the explosion of unstructured data that doesn't follow a predefined format, such as text, images, videos, or sensor data, and demands flexibility in handling diverse data types and processing/analytics frameworks while reducing the increasing storage costs. Most common technologies used in data lakes include:

▪ Hadoop Distributed File System (HDFS) - Cost-effective storage layer for sequential reads/writes typical for batch analytics.

▪ NoSQL database – Optimized for random reads/writes and low-latency operations essential for real-time applications in an operational layer

▪ Apache Spark - Open-source unifying engine for processing raw data to insights

While lakes bring flexibility and scalability to data-intensive AI/ML processes, they require users to have specialized skills. And, without strong data governance and proper access control, data lakes can quickly degrade into “data swamps,” where raw data becomes difficult to trust, and sensitive data can be compromised.

In response, the data industry developed the Open Table Formats (OTF) framework to bring strong features of warehouses to lakes. Delta Lake by Databricks and Apache Iceberg™ by Snowflake are the pivotal players in the evolving “table format wars.”

Lakehouses: Unifying Data Storage

To combine the scalability of lakes with the performance of warehouses, the lakehouse architecture has emerged to support both structured and unstructured data types in one location for analytics and AI/ML. Leading data platforms adopting lakehouse architecture include Lakehouse by Databricks, and BigLake by Google.

To support its large enterprise clients more effectively, Teradata’s VantageCloud Lake encompasses all three design patterns - warehouse, lake, and lakehouse - in one single solution.

The right storage architecture depends on specific business objectives and requires balancing of scalability, budgets, and considerations for business size, future growth, available resources, security, and ease of use.

Migrating legacy data to new repositories challenges data engineers – they need to balance trade-offs in file formats to minimize disruption and maintain data integrity while ensuring compatibility with existing systems.

Data Pipelines and AI Readiness

4. How Data Flows: From Ingestion, Pipeline, Integration to Analytics

Making data useful starts with data ingestion - the process of handling continuous dataflows, either in batch or real time, from various sources regardless of storage architecture. Two common ingestion processes are:

▪ Extract Load Transform (ELT): Extracts data from source systems and loads it directly into a data warehouse before any transformations (e.g., cleaning, aggregating). It’s a well-established method for structured data where compliance and cleansing are priorities.

▪ Extract Transform Load (ETL): Performs transformations before loading data into storage, making it more efficient for handling large volumes of structured and unstructured data in modern big data systems.

Data pipelines automate the flow of data from its source (e.g., applications, databases, devices) to target systems. These pipelines often perform cleaning (e.g., handling missing values, standardizing formats) and transforming (e.g., fit data schemas for data warehouses) along the way.

A Good Pipeline Diagram (Credit: Equal Experts)

Data pipelines often involve multiple data layers, each serving a specific function in different stages and typically include:

▪ Collection Layer: Captures raw data from various sources.

▪ Ingestion Layer: Imports and processes raw data for storage.

▪ Storage Layer: Houses the processed data.

▪ Processing Layer: Prepares data for tasks such as analysis or model training.

▪ Integration Layer: Unifies data from diverse systems.

▪ Access Layer: Provide access for end-users and applications.

▪ Visualization Layer: Supports analytics tools like Tableau and Power BI.

Modern architecture increasingly includes an orchestration layer to coordinates complex data workflows across pipelines.

Data Integration combines internal and often external sources into a comprehensive dataset to support a business objective such as underwriting decisions. Clarifying user and application access - including permissions and roles – is critical for effective data sharing within and outside the organization. Challenge in integration is often due to legacy infrastructure and siloed systems.

Interoperable standards, whether open or vendor-specific, are critical to ensure systems, applications and databases can “talk” to each other and apply to both data integration and data pipelines.

Popular data ingestion tools range from basic hand-coding to comprehensive platforms like Apache Kafka and Informatica that are capable of streaming data and orchestrate workflow automatically, minimizing manual intervention in complex data ecosystems

In the context of data flows, APIs (Application Programming Interfaces) – protocols that govern interactions between systems – enforce a unified view of data across an organization. They can connect legacy systems with modern applications without requiring system overhaul and facilitate integration with external sources and between organizations.

Many advanced data platforms now integrate data integration + analytics + visualization capabilities. Examples include Palantir for sensitive government data while competitors such as IBM, Databricks, and Teradata each have their own strengths. These platforms often allow integration with analytics and business intelligence (BI) players that have integrated AI capabilities, such as Tableau, Alteryx, and SAP.

5. Utilizing Data: Cataloging, Data Mesh, and Data Security

Effective management of data assets begins with data cataloging, a foundational solution often integrated into larger data platforms. Cataloging involves organizing and indexing all data in an organization. It starts with data discovery, typically using automated tools to scan data sources and document available data. As data is uncovered, metadata – data about the structure, format, and usage of the data - is captured, categorized, and tagged with keywords and descriptions. These metadata and tags are then compiled into a searchable catalog, effectively acting as a “search engine” for the organization’s data assets.

Data cataloging is essential for enabling modern architecture like data mesh - this evolving approach decentralizes data ownership and management by assigning responsibility to data domain owners while maintaining federated governance. In a data mesh, data is treated as a product, with domain owners managing quality, security, and compliance end-to-end. Owners are given self-serve tools, operating independent of a central IT team. Early adopters of data mesh include large, data-heavy organizations such as JPMorgan Chase and Goldman Sachs in financial services and Airbnb, Uber in technology.

Data security is ideally embedded through the data pipeline, starting at the ingestion point, where encryption may be applied to safeguard sensitive data. Security extends to protecting information stored in databases and during the integration of tools and applications. Defining and enforcing roles and permissions is critical and is enforced through authentication and authorization practices, particularly during API implementation. Additionally, the data lineage created during cataloging provides an audit trail, tracing data usage and ensuring adherence to policies.

6. Specific to AI/ML: Features and Enterprise Feature Stores (EFS)

AI technologies rely on training data to teach models useful patterns to generate outputs for real-world applications. The raw training data is organized into features – engineered or transformed data readable by AI models -to produce useful outputs.

While raw data is used to answer the question, "What happened?" typically for reporting or dashboards, features answer, "What patterns in the data are relevant for prediction?” for both model training and inference, bridging domain knowledge and the predictive capabilities of AI.

Operationalizing AI/ML at scale - computing and sharing features across teams and stages, from training to production - can be complex, costly, and prone to inconsistency. To address these challenges, enterprise feature store (EFS) technology has emerged to centralize the transformation of raw data into model-ready features and provide them for reuse in ML workflow.

First introduced by Uber in 2017, feature stores have become a core component of modern AI/ML platforms, integrated into major data and cloud solution players, such as AWS’s SakeMaker, Google Cloud’s Vertex AI, Databricks, and Teradata’s Vantage Console.

7. Emerging AI-Enabling Technologies: Data Fabric, Semantic AI, RAG, AutoML, Zero-Copy

Building an AI-ready data environment is no small task. Real-world data is often messy, especially in organizations relying on legacy systems developed long before modern data frameworks. The urgency around AI adoption has brought over-due attention to data management and accelerated advances in automated data solutions, driving innovations in several key technologies shaping this landscape.

Data Fabric: A Next-Generation Architecture

Data fabric, a next-generation design approach, emerging in the 2020s, to create an intelligent and automated data management layer across a distributed data landscape, linking diverse data repositories - data warehouses, lakes, relational databases – across on-premises, cloud, hybrid, and edge environments in real time. It features a single access point, automated bi-directional processing, and cost optimization by pushing queries to the most efficient processing engine. It often leverages AI/ML to automate tasks such as schema recognition, data integration, and metadata utilization, expediting the journey to data-ready for AI deployment.

Integrating legacy systems into data fabric architecture, especially in industries like financial services where mergers and acquisitions are common, often requires updating outdated programming and deploying middleware to bridge technology gaps.

Semantic AI: Understanding Data Beyond Syntax

Semantic AI enhances data usability by creating consistent data vocabularies across diverse data sources and formats. Using AI/ML techniques such as natural language processing (NLP) and knowledge graphs, it adds metadata to structured data stored in silo systems, focusing on the meaning of information (ontologies) rather than syntax or surface-level patterns. For example, illumex’s generative semantic fabric platform replaced Excel-based processes to provide full private information mapping for an insurance broker to achieve compliance with SOC2 (a data security compliance framework).

Retrieval-Augmented Generation (RAG)

Technologies like Retrieval-augmented generation (RAG) and GraphRAG are increasingly used to fine tune Gen-AI foundation models, such as Chat-GPT, with enterprise-specific private data to address hallucinations (incorrect AI responses) and improve model performance. However, RAG solutions require robust AI-ready data infrastructure and can be expensive when manual labeling and maintenance of structured data are involved.

Automated Machine Learning (AutoML)

Organizations lacking in-house AI expertise may consider AutoML solutions such as dotData, a pioneer of feature store automation that apply AI/ML and Gen-AI techniques to create, store, and allow sharing and reuse of features across AI/ML projects. Users upload data from various sources and define the problem to solve, while the platform handles cleaning and transformation and builds predictive models using ML techniques. While this allows firms to start harnessing the power of AA without an AI team, interpreting insights often requires domain-specific knowledge.

Zero-Copy Techniques: Reducing Latency

To reduce latency and minimize memory usage and data transfer overhead, Zero-Copy techniques have emerged to minimize unnecessary data copying between storage devices, CPUs, and GPUs. Examples include Nvidia’s CUDA framework, Apache Arrow’s In-Memory Data Format, and Intel’s DAOS.

Data Quality and Governance

8. Data Quality and Governance for AI-Ready Data

Data in the real world is never perfect. However, striving for quality data is essential for successful AI adoption. AI-ready data should be up-to-date, relevant, accurate, and consistent, in machine-readable formats. It also needs comprehensive coverage and sufficient volume to avoid overfitting, along with the necessary features to produce reliable model outputs.

Establishing a Strong Data Governance Framework

To uphold data quality standards, leading organizations often have a comprehensive data governance framework with detailed policies outlining how data should be collected, used, stored, and shared. They define data ownership, roles and responsibilities, and set up robust access controls and incident response procedures. For continuous improvement, they establish quality metrics and KPIs, along with feedback mechanisms to enhance data quality over time. They also equip data officers with tools to assess and monitor quality, and invest in metadata management, data integration, interoperability, and architectural guidance for automation throughout data lifecycles.

Meeting Privacy and Data Security Mandates

Safeguarding sensitive and personal data to meet privacy and data security mandates is non-negotiable. Regulations such as Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), along with AI-specific regulations such as EU’s AI Act, require organizations to protect data at every stage of AI deployment.

These mandates often challenge existing risk management and governance frameworks. In response, organizations increasingly turn to data intelligence platforms such as BigID to gain visibility into their data landscape, automate compliance processes, and strengthen data governance.

By combining robust governance with advanced technology, organizations can better ensure their data is secure, compliant, and AI-ready.

9. Data Transparency, Explainability, and Fairness - Building Trust in AI

In regulated industries like insurance, blackbox models - where users can see only inputs and outputs but not the internal logic - are unacceptable. To address this, Explainable AI (XAI) methods such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) have been developed to increase model transparency and trust.

Technology providers are increasingly embedding explainability into their platforms. For example, Mind Foundry integrates XAI techniques to create high-stakes AI applications more interpretable and accessible, particularly for users in regulated industries such as defense and insurance.

Bias in AI remains a critical concern due to the inherent biases in training data and algorithm design. With AI increasingly influencing decision-making such as loan approvals and hiring, platforms have emerged to increase model fairness. For example, Citrusˣ proactively tackle this issue by analyzing feature correlations within AI models to uncover hidden biases.

MOVING FORWARD

Data is Complex but Solvable:

This primer simplifies the complexities of building an AI-ready data foundation – an essential step for successfully adopting AI technologies.

For Non-tech Managers:

Mastering key concepts and vocabulary to gain the context you need to navigate data infrastructure and ecosystems and appreciate emerging technologies. Collaborate with IT and data teams to shape business-drive strategy, create a roadmap to mature your organization, and boldly move forward with purpose-driven AI initiates to stay relevant and competitive.

Begin Today:

Data doesn’t need to be perfect – in quality, quantity or internal capabilities - to deliver results. By adopting frameworks like data mesh or hybrid computing and partnering with solution providers offering modern and agile capabilities, your organization can begin scaling today - one project and one process at a time. For example, Kenja (data integration and LLM- and RAG- powered applications) and indemn.ai (AI/human-in-the-loop workflow for carrier-agents communications) allow enterprises to leverage AI technologies today without the need for overhauling legacy infrastructure.

Be Adaptable:

In this early era of AI adoption, technology landscape is still evolving, adaptability is key. Embrace technological advances through the right partners and solutions. Align AI and data strategies with your broader business goals to accelerate adoption and achieve sustainable success.

Transparency, Fairness, and Explainability:

Instilling these principles into AI workflows not only reduces risks but also builds trust with stakeholders, regulators, and customers – creating a significant competitive advantage in today’s environments.

The Big Picture:

AI and data isn’t just about technology; it’s about creating an ecosystem and culture where tech-enabled humans collaborate to integrate data and AI in ways that make a difference.

Questions? Need help moving forward? Get in touch.

* Ichun Lai founded Propel Global Advisory LLC focusing on accelerating the thoughtful and responsible adoption of AI technology in financial services

Ichun Lai