Building Effective Machine Learning Pipelines: Tabular vs. Text Analytics Use Cases

Robust ML pipelines are essential to handle different types of data and tasks efficiently. We explore the 3-pipeline design and how it maps to tabular and text use cases w/ batch and streaming refresh

Apr 13, 2024

Machine learning (ML) has become indispensable in various domains, driving the need for robust and scalable ML pipelines to handle different types of data and tasks efficiently. In this article, we'll explore the design of ML pipelines for two common use cases: tabular data, where we will utilize time series forecasting as an example, and text analytics, where we will demonstrate via named entity recognition (NER) as an example.

When architecting the ML pipelines, we will utilize the 3-pipeline design, as first coined and designed by a trailblazing MLOps engineer, Pau Labarta Bajo. In this article, both use cases will be mapped to the 3-pipeline design, which includes the feature pipeline, ML training pipeline, and inference pipeline. However, their implementation differs based on the characteristics of the data and tasks involved.

3-Pipeline Design for Complete ML Systems

Many data scientists and data science training programs/certifications focus on training the ML model inside notebooks. However, there is no business value for ML models sitting in a local environment. For business value, you need to deploy and operationalize the model in production, as a scalable, robust, and efficient enterprise ML application. This is where an ML architect (architect the solution), ML engineer (architect and build the ML pipelines), data engineer (feature pipeline, feature store, and data refresh) and MLOps engineer (ML orchestration and ML operations) can help to transform these notebooks from local ML solutions into enterprise ones.

The 3-Pipeline Design: Components

Feature Pipeline: Extracts, transforms, and preprocesses data to generate features for training ML models. Features are stored in a feature store for easy access during training and inference.
ML Training Pipeline: Trains ML models using historical data, evaluates model performance, and registers different versions of models in a model registry for tracking and versioning.
Inference Pipeline: Deploys trained models to make predictions on new data in a scalable and efficient manner, ensuring real-time or batch inference capabilities.

This 3-pipeline design is contained inside a data science environment/cluster. The integration of the DS cluster with the data sources is via the feature pipeline, and the integration of the DS cluster with the enterprise ML application consuming the predictions/classifications is via the inference pipeline. The 3-pipeline design inside a DS cluster also includes a feature store, ML model registry, and pipeline code repository. I will share the details of what this DS cluster should look like in a future article.

Tabular vs Text Analytics Use Cases for the 3-Pipeline Design

The 3-pipeline design is the architecture needed for a complete ML system. This 3-pipeline design can be mapped to the ML systems utilizing both tabular and text analytics use cases. To demonstrate the similarities and differences, I will utilize time series forecasting example for tabular, and NER for text.

Tabular Use Case (Time Series Forecasting)

Feature Pipeline: Extracts features like historical sales data, transforms, and preprocesses them before storing in a feature store.
ML Training Pipeline: Trains models using algorithms like ARIMA or deep learning approaches, evaluates performance, and registers models in a model registry.
Inference Pipeline: Deploys models to make predictions on new data, ensuring scalability and efficiency.

Text Analytics Use Case (Named Entity Recognition)

Feature Pipeline: Extracts features like word embeddings and part-of-speech tags from text data, preprocesses, and stores them in a feature store.
ML Training Pipeline: Trains models using labeled text data, validates performance, and registers models in a model registry.
Inference Pipeline: Deploys models to recognize named entities in new text data, ensuring real-time or batch inference capabilities.

The 3-pipeline design can also be mapped to a Generative AI (LLM) use case as well, and will be a topic for a future article.

Batch vs Streaming

Data refresh is another important consideration for your complete ML system utilizing the 3-pipeline design. Batch processing involves collecting and processing data in large volumes at once, while streaming processing handles data in real-time as it arrives, allowing for immediate analysis and response. Batch is much easier to architect and implement.

Batch Data Refresh Example

Tabular Use Case: Refreshes historical sales data, updates feature store, retrains models, evaluates performance, and deploys updated models for inference.
Text Analytics Use Case: Collects new text data, preprocesses, updates feature store, retrains NER models, validates, and deploys for inference.

Real-Time Data Refresh Example

Tabular Use Case: Continuously monitor and preprocess incoming sales data in real-time; employ online learning techniques to update the model dynamically, ensuring up-to-date predictions.
Text Analytics Use Case: Capture and preprocess real-time text streams containing named entities; utilize online learning approaches to continuously refine the NER model and perform immediate inference on incoming text data.

Conclusion

Effective machine learning pipelines are essential for deploying ML solutions at scale. By understanding the 3-pipeline design and tailoring it to specific use cases like tabular data for time series forecasting and text analytics for named entity recognition, organizations can ensure efficient data processing, model training, and inference, ultimately driving better insights and decision-making. Whether it's forecasting sales trends or extracting entities from unstructured text, the key lies in designing robust pipelines that leverage the unique characteristics of the data and tasks involved.

If you're looking for support, here is how to contact me:

Coaching and Mentorship: I offer coaching and mentorship; book a coaching session here

Enterprise Data Science

Discussion about this post