Replicating a Data Platform Across Manufacturing Domains

Manufacturing data is messy in ways that other industries rarely encounter. A single factory might produce time-series telemetry from PLCs at millisecond intervals, quality measurements from laboratory instruments in proprietary file formats, production orders from SAP in OData format, and scanned documents from decades-old processes that someone uploaded to SharePoint. Now multiply that by four entirely different business domains, each with its own data formats, business logic, and stakeholder expectations.

This was the challenge we faced with a German manufacturing conglomerate operating across four distinct divisions. Each domain — Grinding & Dressing (GD), Supply Chain & Procurement (SCP), Sorting & Inspection (SXT), and a processing division — needed its own data platform. But building four completely independent platforms would be wasteful and unmaintainable. Building one monolithic platform would collapse under the weight of domain-specific requirements. We needed a third path.

The Constraint: Same Architecture, Different Everything Else

The conglomerate’s IT leadership had a clear mandate: architectural consistency across all four domains. Every domain should use the same pipeline patterns, the same medallion architecture, the same deployment tooling. But the data sources, transformation logic, and output requirements were radically different across domains.

GD dealt heavily with Siemens Historian time-series data from grinding machines — high-frequency sensor readings measuring spindle speed, vibration, temperature, and coolant flow. SCP focused on procurement data from SAP, supplier performance metrics, and material master records. SXT processed XRD (X-ray diffraction) crystallography files, a niche scientific data format used for material quality analysis. The processing division had its own mix of ERP data and shop-floor systems.

The question was how to build a platform architecture that was standardized enough to maintain as a single pattern, yet flexible enough to handle XRD files and SAP OData feeds with equal competence.

The Foundation: Parameterized ADF Pipelines

Azure Data Factory became our orchestration backbone, but we used it in a deliberately constrained way. Rather than building custom pipelines for each data source, we designed a set of parameterized pipeline templates that could handle any source type through configuration.

The core ingestion pipeline template accepts parameters for source type (SFTP, REST API, SharePoint, database, blob storage), connection details (stored in Azure Key Vault and referenced by name), file format and parsing configuration, target landing zone path, and scheduling configuration.

This means the pipeline that ingests XRD crystallography files from a network share and the pipeline that pulls SAP OData change data capture use the exact same ADF pipeline definition. The difference is entirely in the parameters passed at runtime, which are stored in a JSON configuration file per data source.

We organized these configurations in a repository structure where each domain has its own directory containing source-specific JSON configuration files. Adding a new data source to any domain means creating a new JSON file, not modifying pipeline code.

The one area where we did create domain-specific ADF components was in the linked services and integration runtimes. Each domain has its own set of connections, and some domains require self-hosted integration runtimes to access on-premises data sources behind corporate firewalls. These are provisioned through Terraform, which we will discuss later.

The ParserFactory Pattern

The most architecturally significant decision we made was the ParserFactory pattern for handling the wildly different data formats across domains. This is a classic factory pattern adapted for PySpark, and it turned out to be the single most reusable component in the entire platform.

The pattern works as follows. There is a base class called BaseParser that defines the interface every parser must implement. It has three methods: validate (checks that the source file or payload conforms to expected structure), parse (transforms the raw input into a standardized PySpark DataFrame), and get_schema (returns the expected output schema for downstream consumers).

Then there is a ParserFactory class that acts as a registry. You register parser implementations against a source type identifier, and the factory returns the correct parser when given that identifier. The factory is configured through — you guessed it — a YAML configuration file that maps source types to parser class paths.

The domain-specific parsers inherit from BaseParser and implement the three methods for their particular data format. Here are some of the more interesting ones we built:

SiemensHistorianParser: Handles time-series data exported from Siemens WinCC Historian. The raw data arrives as CSV files with a non-standard header format — the first few lines contain metadata about the tag configuration, followed by the actual time-series data with timestamps in a locale-specific format. The parser strips the metadata header, normalizes timestamps to UTC, pivots tag values from a wide format to a long format (one row per tag per timestamp), and applies unit conversion based on a tag configuration table.

XRDFileParser: Parses X-ray diffraction data files, which come in several flavors (Bruker RAW, PANalytical XRDML, and generic XY formats). Each format stores diffraction angle (2-theta) and intensity values differently. The parser detects the format from file headers, extracts the measurement data, and produces a standardized DataFrame with columns for sample ID, 2-theta angle, intensity, and measurement metadata. This parser also extracts peak positions and crystallite size estimates, which the materials science team uses for quality control.

SAPODataParser: Handles SAP OData responses for Change Data Capture. SAP OData feeds return data in a paginated JSON format with metadata about each entity’s change type (create, update, delete). The parser handles pagination, extracts the change type, flattens nested structures (SAP loves deep nesting for address blocks and custom fields), and produces a DataFrame with the flattened entity data plus CDC metadata columns.

SharePointDocumentParser: Ingests documents from SharePoint libraries using the Microsoft Graph API. This parser downloads documents, extracts text content (supporting PDF, Word, and Excel formats via Apache Tika), and produces a DataFrame with document metadata (title, author, modified date, library path) and extracted text content. This was primarily used by the SCP domain for supplier contract analysis.

The beauty of this pattern is extensibility. When the processing division needed to ingest data from a new shop-floor system that exported in a custom XML format, we wrote a new parser class, registered it in the factory configuration, and added the source to the domain’s ingestion config. No changes to any existing code. No risk to other domains.

Shared Transformations, Domain-Specific Logic

The medallion architecture (bronze, silver, gold) is consistent across all four domains, but the transformations at each layer are domain-specific. We handled this with a layered approach.

Bronze layer transformations are generic: schema enforcement, data type casting, deduplication based on a configurable key, and append-mode writes to Delta tables. These are identical across domains.

Silver layer transformations are where domain-specific logic lives. Each domain has its own set of transformation notebooks organized by entity. However, the notebook structure follows a common template: read from bronze, apply business rules, handle slowly changing dimensions where applicable, write to silver. The business rules are the variable part. GD’s silver transformations include calculations like material removal rate (derived from multiple sensor readings) and tool wear estimation. SXT’s silver transformations include peak matching algorithms for XRD data.

Gold layer transformations produce domain-specific analytical models. GD has a machine health scoring model. SCP has supplier performance scorecards. SXT has material quality certification tables. These are the most divergent across domains and are owned entirely by the respective domain teams.

Infrastructure as Code with Terraform

Deploying four instances of the same architecture manually would be a nightmare. Terraform made it manageable.

We structured the Terraform code as a set of reusable modules, with each module representing a platform component: the Databricks workspace, the ADF instance with its linked services, the Azure Data Lake Storage hierarchy, the Key Vault with its access policies, and the networking components (VNet, subnets, private endpoints).

The root module composes these into a complete platform deployment, parameterized by domain. Each domain has its own Terraform variable file (a tfvars file) that specifies the domain name, resource naming prefix, SKU choices, data source connection details, and any domain-specific feature flags.

Deploying a new domain — or redeploying an existing one — is a matter of running Terraform with the appropriate variable file. In practice, we wrapped this in a CI/CD pipeline where infrastructure changes go through the same pull request workflow as application code.

The Terraform state for each domain is stored in a separate Azure Storage container, using the domain name as the container prefix. This isolation prevents one domain’s infrastructure changes from accidentally affecting another, while the shared module source ensures architectural consistency.

One particularly useful pattern was the conditional resource creation based on domain feature flags. Not every domain needs a self-hosted integration runtime. Not every domain needs a Streamlit dashboard server. By using Terraform’s conditional expressions and the count meta-argument, we toggle these optional components per domain without maintaining separate module configurations.

Streamlit Dashboards for Domain-Specific Visualization

Each domain needed its own visualization layer tailored to its specific data and stakeholders. We chose Streamlit for its rapid development cycle and its natural fit with the Python-based data processing stack.

The dashboards connect to the gold layer Delta tables through Databricks SQL endpoints. Each domain has a dedicated SQL warehouse provisioned through Terraform, sized according to its query patterns. GD’s dashboard focuses on machine telemetry visualization — time-series charts of sensor readings with anomaly highlighting. SCP’s dashboard shows supplier scorecards with drill-down capabilities. SXT’s dashboard renders XRD diffraction patterns as interactive plots, allowing materials scientists to compare samples visually.

We deployed the Streamlit applications on Azure Container Instances, one per domain, with Azure Active Directory authentication to ensure that domain stakeholders only access their own dashboards. The deployment is containerized and included in the CI/CD pipeline.

Balancing Standardization and Domain Autonomy

The hardest part of this engagement was not any individual technical challenge. It was finding the right balance between what should be standardized and what should be domain-specific.

We learned to apply a simple heuristic: standardize infrastructure and patterns, customize logic and presentation. The ADF pipeline templates, the ParserFactory framework, the medallion architecture, the Terraform modules, the CI/CD pipelines, the monitoring setup — these are all standardized. The parser implementations, the silver and gold transformations, the dashboard layouts, the alert thresholds — these are domain-specific.

This boundary was not obvious from the start. We initially tried to standardize the silver layer transformations using a configuration-driven approach similar to the DQ engine. This turned out to be a mistake. Domain-specific business logic is inherently complex and context-dependent. Trying to express material removal rate calculations or XRD peak matching in a generic configuration language produced configurations that were harder to understand than code. We reverted to domain-specific notebooks with a standardized template and interface contract.

Deployment Strategy: One Template, Four Configurations

The deployment pipeline is a single Azure DevOps pipeline definition with domain-specific stages. A commit to the repository triggers the pipeline, which detects which domains are affected by the change (using path-based filtering), runs domain-specific tests, plans the Terraform changes, and deploys them after approval.

For application code (parsers, transformations, dashboards), we use a similar pattern. The CI pipeline builds a versioned artifact (a Python wheel for parsers and transformations, a Docker image for dashboards) that is tagged with the domain name. CD deploys the artifact to the appropriate environment.

The critical design decision was that all domains share a single repository. We considered a multi-repo approach (one per domain) but rejected it because it would allow the architectures to diverge silently. In a monorepo, when someone modifies a shared module, they see immediately which domains are affected. Code reviews naturally enforce consistency because reviewers see the cross-domain impact of every change.

Results and Key Takeaways

After rolling out across all four domains over a six-month period, the platform delivers:

90% code reuse across domains for infrastructure and pipeline framework components.
New domain onboarding reduced from months to weeks. When a fifth business unit expressed interest, we had their base platform running in three weeks.
Consistent monitoring and alerting across all domains, reducing the operational burden on the central data team.
Domain autonomy for business logic and visualization, enabling each domain to iterate independently.

The key takeaways from this engagement are worth stating explicitly.

First, the ParserFactory pattern is one of the highest-leverage investments you can make in a multi-domain data platform. It cleanly separates the “how to read this format” concern from the “how to process this data” concern, and it makes adding new data sources a matter of writing a single class.

Second, Terraform with domain-specific variable files is the right granularity for infrastructure reuse. Modules provide the shared architecture; variable files provide the domain customization. Anything more complex (like trying to parameterize the Terraform module structure itself) adds complexity without proportional benefit.

Third, resist the urge to over-standardize business logic. Configuration-driven approaches work brilliantly for infrastructure and data quality rules. They work poorly for complex domain-specific transformations where the logic is inherently procedural and context-dependent.

Fourth, a monorepo with path-based CI/CD filtering is the right choice for multi-domain platforms where architectural consistency matters. The overhead of managing a single repository is far less than the cost of domains silently diverging.

Finally, invest in the parser layer early. In manufacturing, the data format problem is much harder than the data volume problem. Getting data from Siemens Historian, XRD instruments, and SAP into a common shape is where most of the engineering effort goes. Once the data is in Delta tables with a consistent schema, everything downstream is comparatively straightforward.

Replicating a Data Platform Across 4 Manufacturing Domains