InfiniSynapse Reference Guide

Data Analysis Glossary in 2026: 40 Terms Explained

A 2026 data analysis glossary with 40 terms across SQL, statistics, BI, data engineering, and AI agents — each defined with a worked example and a deeper link.

AuthorInfiniSynapse Research, product and data architecture team
Published2026-06-28 · Last verified 2026-06-28 · Next review 2026-09-28
Evidence baseVendor documentation (PostgreSQL, MySQL, Snowflake, dbt), statistics references (Kohavi, Wasserstein), and field experience defining these terms across data teams.
Disclosure: Published by InfiniSynapse, an AI data analyst. The glossary is vendor-neutral; InfiniSynapse appears only under the AI-agent category for completeness.
TL;DR
This data analysis glossary covers 40 working terms in five groups — SQL, statistics, BI and reporting, data engineering, and AI agents. Each term gets a one-sentence definition, a worked example, and a link to a deeper guide. Designed to settle vocabulary in five minutes rather than be an encyclopedia.
Data analysis glossary — 40 terms grouped into SQL, statistics, BI, data engineering, and AI agents, each with a worked example.

SQL terms — the eight that come up daily

TermDefinitionExample
GROUP BYCollapse rows into one row per group, with aggregates over the restSELECT region, COUNT(*) FROM orders GROUP BY region
HAVINGFilter groups after aggregation; WHERE filters rows beforeHAVING SUM(amount) > 1000
INNER JOINReturn only rows matched on both sides of the join keyorders o JOIN customers c ON c.id = o.customer_id
LEFT JOINKeep all rows from the left table; NULLs where right has no match"customers who have not ordered" → LEFT JOIN with WHERE orders.id IS NULL
CTENamed result set used in a later SELECT — readable multi-step logicWITH monthly AS (...) SELECT * FROM monthly
Window functionCompute over a window of rows without collapsing — ranks, running totalsRANK() OVER (PARTITION BY region ORDER BY revenue DESC)
UPSERT / MERGEInsert if new, update if existing, in one statementMERGE INTO dim_customer USING src ON ... WHEN MATCHED THEN UPDATE
NULLIFGuard divisions against zero by returning NULL when input matchesrevenue / NULLIF(customers, 0)

The deeper walkthrough is in SQL for data analysis.

Statistics terms — the eight an analyst needs in a stand-up

TermDefinitionExample
p-valueProbability of seeing data this extreme if the null hypothesis is true"p < 0.05" is the conventional but flawed cutoff for significance
Confidence intervalRange that contains the parameter with stated confidence95% CI for conversion rate: [3.2%, 3.7%]
Sample size / powerNumber of observations needed to detect an effect of a given size"We need 50,000 users per arm to detect a 1% lift at 80% power"
SRMSample Ratio Mismatch — assignment split deviates from intended49.2% vs 50.8% might signal broken randomization
NTILE / quantileBucket rows into N equal-sized groupsIncome quintiles via NTILE(5)
Cohort retentionShare of an entry cohort still active at time N"% of signup cohort still active 90 days later"
Cohort vs segmentCohort = same entry time; segment = same attribute"March signups" vs "iOS users"
Standard errorEstimated standard deviation of a sample statisticSE of mean = SD / sqrt(n)

BI and reporting terms — the eight that anchor dashboards

TermDefinitionExample
Semantic layerShared business definitions of metrics across BI consumers"Active customer = ordered in last 90 days"
Dimension / measureDimension = how you split; measure = what you sum or countDimension: region. Measure: revenue.
Drill-downMove from aggregated view to constituent rowsClick a region's revenue → see orders by city
Slowly changing dimensionHow dimension attributes change over time — Type 1 overwrite, Type 2 historyCustomer plan history tracked Type 2 for cohort analysis
Fact / dim tableFact = events with measures; Dim = descriptive attributesorders (fact) joined to customers (dim)
Star schemaOne fact table joined to multiple dimension tablesorders fact + customer / product / time dims
YoY / WoWYear-over-year, week-over-week period comparison"Revenue up 14% YoY"
OLAP cubePre-aggregated structure across dimensionsModern equivalent: materialized aggregate tables

Data engineering terms — the eight pipeline regulars

TermDefinitionExample
ELT / ETLExtract-Load-Transform vs Extract-Transform-LoadModern default for cloud warehouses is ELT
dbtTransformation framework — SQL models, tests, docsdbt staging / intermediate / mart pattern
CDCChange Data Capture — only the changed rows since last syncPostgres logical replication feeding the warehouse
Reverse-ETLPush warehouse data into operational tools like CRMsHightouch, Census syncing account scores to Salesforce
Data contractProducer-consumer agreement on schema and SLA"orders.id will not be reused; deletes are soft"
LakehouseSingle platform combining warehouse and lake capabilitiesDatabricks Delta Lake
Schema driftSource schema changes that break downstream modelsNew column added upstream; dbt staging absorbs it
Materialized viewStored query result, refreshed on a schedule or triggerSnowflake materialized view for repeatedly hit aggregates

Deeper coverage in SaaS data platform and data integration platforms.

AI agent terms — the eight that came into common use in 2025–2026

TermDefinitionExample
Data agentAI agent that reads, queries, and reasons over structured dataPlans, runs SQL, verifies, returns evidence
NL2SQLNatural language to SQL translation"What was revenue last month?" → SELECT ... FROM orders
RAGRetrieval-augmented generation — model retrieves context before answeringAgent retrieves business glossary before drafting SQL
Knowledge base bindingPair database with curated business definitions an agent reads as toolThe differentiator described in database + KB binding
Plan modeAgent presents a plan for human review before executing"I will join orders to customers on ... — approve?"
Verification stepIndependent query that cross-checks the main resultCount rows two ways; surface a gap if they disagree
Evidence trailPlan + code + result + verification + sources, stored togetherThe audit-grade artifact for regulated workflows
Agentic analyticsBI shifted from pre-modeled questions to planner-executor-verifier loopsDetailed in agentic analytics explained

How to use this glossary

  1. Onboard a new hire. Send the link; ask them to flag terms they cannot define in one sentence after reading.
  2. Settle a meeting argument. "When you said cohort, did you mean signup cohort or feature cohort?" — point to the row.
  3. Sanity-check vocabulary before a review. Skim before a stakeholder meeting; calibrate to the audience.
  4. Bookmark the changes. The list updates every 90 days; the AI-agent section moves fastest.
Glossaries do not stop arguments — they make them shorter.

See agentic analytics terms in action

Connect a small Postgres or MySQL database read-only. Seed a business glossary. Ask one question and watch which glossary entries show up in the answer trail — plan mode, RAG retrieval, knowledge base binding, verification step, evidence trail, all visible together.

Try InfiniSynapse online

FAQ

What is included in this data analysis glossary?
Forty terms across five categories — SQL (GROUP BY, HAVING, JOIN, CTE, window function, UPSERT, NULLIF, MERGE), statistics (p-value, confidence interval, sample size, SRM, NTILE, cohort retention, segments, standard error), BI and reporting (semantic layer, dimension, measure, drill-down, slowly changing dimensions, fact and dim tables, star schema, YoY and WoW, OLAP cube), data engineering (ELT, dbt, CDC, reverse-ETL, data contract, lakehouse, schema drift, materialized view), and AI agents (data agent, NL2SQL, RAG, knowledge base binding, plan mode, verification step, evidence trail, agentic analytics).
Why does this glossary update every 90 days?
Vocabulary moves fast in data work. Terms like "data lakehouse" replaced parts of "data warehouse" between 2020 and 2023, "agentic analytics" entered common use in 2024 and 2025, and "knowledge base binding" emerged in 2025 and 2026. A 90-day refresh cadence keeps the glossary aligned with current vendor and academic usage rather than freezing the vocabulary at one moment in time.
What is the difference between cohort and segment in data analysis?
A cohort groups records by the time they entered — March 2026 signups, users who first activated feature X this quarter, customers acquired through paid social in Q1. A segment groups records by an attribute — iOS users, customers in EMEA, accounts on the annual plan. Cohorts measure how things change over the entry timeline; segments measure how groups differ at a moment.
What is a window function in SQL?
A window function computes a value over a window of related rows without collapsing them. The classic examples are RANK, ROW_NUMBER, SUM as a running total, and LAG or LEAD to access previous or next rows. Unlike GROUP BY which collapses rows into one per group, a window function keeps every row and attaches a computed value derived from its window.
What is the difference between ELT and ETL?
ETL extracts data, transforms it inside the pipeline tool, and loads the modeled output into the warehouse. ELT extracts and loads raw data into the warehouse first, then transforms inside the warehouse — typically via dbt. The modern default for cloud warehouses like Snowflake, BigQuery, and Redshift is ELT because warehouse compute is now cheap enough to land raw data and model it in place.
What does knowledge base binding mean for an AI data agent?
Knowledge base binding pairs a database connection with a curated layer of business definitions that an AI agent retrieves as a tool call before drafting SQL. The bound layer contains entries like "active customer = ordered in last 90 days", "paid status = subscription_status IN (active, trialing)", or "MRR = sum of recurring subscription amounts excluding refunds". The agent uses these definitions to ground its SQL in your team operational vocabulary.
How is this glossary useful for onboarding a new analyst?
Send the link as part of week-one onboarding; ask the new analyst to flag any term they cannot define in one sentence after reading the table. Their flagged list becomes the agenda for a 30-minute calibration session with their manager or buddy. The glossary is opinionated — terms a working data analyst meets weekly in 2026 — so coverage maps closely to the vocabulary they will encounter in the first month.

Methodology and review notes

Last updated: 2026-06-28 · Next scheduled review: 2026-09-28

This glossary draws on vendor documentation from PostgreSQL, MySQL, Snowflake, BigQuery, dbt Labs, and the major BI vendors; statistics references including Ron Kohavi experimentation literature and the American Statistical Association statement on p-values; AI agent research from Anthropic and the ReAct paper; and field experience defining these terms in onboarding and review settings across multiple data teams. The selection is opinionated toward terms used weekly in 2026.

Conflict of interest: InfiniSynapse publishes this guide and sells an enterprise AI data analyst. To reduce bias, the page leads with the topic itself, treats InfiniSynapse as one option among many, and links to external sources for every numeric claim.

Update cadence: Reviewed every 90 days for accuracy and link health.

Sources and references

  1. [Vendor] PostgreSQL. Documentation. postgresql.org/docs.
  2. [Vendor] Snowflake. Documentation. docs.snowflake.com.
  3. [Vendor] dbt Labs. Documentation. docs.getdbt.com.
  4. [Independent] American Statistical Association. Statement on p-values. amstat.org p-value statement.
  5. [Independent] Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. arxiv.org/abs/2210.03629.
  6. [Vendor] Anthropic. Building Effective Agents. anthropic.com/research/building-effective-agents.
  7. [Standard] NIST. AI Risk Management Framework. nist.gov/itl/ai-risk-management-framework.
  8. [Independent] BIRD-SQL benchmark. bird-bench.github.io.

Related guides