The data pipeline sits at the heart of any business intelligence process. If you’ve ever shopped from a mobile application or website, then you’ve experienced the power of a data pipeline. Companies like Nordstrom and Macy’s use data pipelines to stream data from on-prem and cloud-based databases to provide real-time inventory and pricing data that make your shopping experience possible. They also extract data collected from their mobile app and website and push it along with the data generated by point-of-sale devices in stores to create reports and forecast future sales.
So, what is a data pipeline?
A data pipeline exists anywhere data is moved from one system to another or any time one system’s output becomes another system’s input, including between applications in a microservices architecture. It consists of the tools and processes that move data from a source system and transform it into a format that is useable by a target system.
The data from a single source system may feed multiple target systems that require different formats and models. The data may come from other databases, or it may be extracted from application logs, monitoring solutions, and other production systems.
Anatomy of a data pipeline
A typical data pipeline involves three high-level components:
- Data Extraction – This is where raw data is collected from a variety of sources and ingested into a system for further processing. This can include external sources like websites, social media feeds, or public databases, among many others. It can also include internal sources such as operational systems or other databases that contain relevant information. Data may be extracted from other systems in batches or streaming it for real-time analytics.
- Data Transformation – Once raw data has been collected, it often needs to be transformed in order to make it useful for analysis. Transformation includes steps such as cleaning up any invalid or erroneous data points (such as duplicate entries or incorrect formats), changing its format so it’s compatible with other datasets in your system, integrating disparate datasets to provide meaningful data, and extracting specific attributes you want to focus on.
- Data Loading – Once you’ve collected and transformed your data, it needs to be stored somewhere so you can reference it later when needed. Storing the output may require different approaches, including temporary storage for data transformation.
ETL vs. ELT pipelines
For many people, the term “data pipeline” is synonymous with ETL (Extract, Transform, Load), but ETL simply describes one discrete type of data pipeline. With ETL, data is extracted from one or more source systems and put into temporary storage where it is transformed before it’s loaded into the final destination, usually a data warehouse. ETL is useful for pulling data from legacy systems and converting it into the format required for modern architecture and consolidating similar data from disparate sources. This approach works best when you don’t need to duplicate the raw data since the extraction process must be repeated any time business rules change. ETL is the most common form of data pipeline and has been a standard for many years, but it is only for approach.
ELT (Extract, Load, Transform) is another data pipeline technique that loads the unmodified data into a centralized data lake as-is, and then data cleansing, enrichment, and transformation occur inside the data warehouse itself. Since the raw data permanently resides in the data warehouse alongside the transformed data, it can be processed into multiple formats without having to extract the data again. This is useful when you don’t know exactly how you are going to use the data. ELT is more efficient for larger data sets, and you can test different use cases without transforming the entire data set each time.
Whether you should use ETL or ELT depends on the specific needs of your team.
Data warehouse vs. data lake
Both data warehouses and data lakes are useful in a data pipeline but each one has distinct features that will determine which is best for your workflow.
- Data Warehouse – A data warehouse stores structured data that has been translated by an ETL process. Data typically does not enter a data warehouse until it is ready for use since the schema for the data in a data warehouse is predetermined.
- Data Lake – A data lake stores both structured and unstructured data in a multitude of formats. It is a useful place to dump all your data, not only the data that is in use but data that you may want to use later and data that you are no longer using but may want to reference later. It stores the unprocessed data that has been extracted using ELT but has not yet been transformed into a useable share. A data lake stores data in its native format in a flat architecture that can be queried as needed. It’s a cost-effective way to rapidly store lots of data for later processing.
Types of data storage
Data storage is a key component of a data pipeline regardless of which approach you take. The type of storage you choose depends on the type of data you wish to store. Here are some common approaches.
Blob stands for binary language object and consists of data stored in binary format. It is used for storing unstructured data such as images and video or audio for streaming. Blob storage is also effective for storing data that changes frequently, such as log files, and is useful for storing data for backup and disaster recovery processes. Data is stored in containers that are organized similarly to a file system directory structure and is available over HTTP or HTTPS.
Data may also be stored in a database in a structured or unscheduled manner. SQL databases, or relational databases, consist of tables of data that are connected to each other using primary and foreign keys, while NoSQL databases allow you to store data in different structures so you don’t need to define the schema ahead of time. With NoSQL, you can store data in column form, key-value stores, standard document formats such as JSON and XML, or in graph form.
A streaming database is a real-time data repository used for storing data that continuously streams in from multiple sources. As new data flows in, the results of registered queries are immediately updated, so you always have access to the latest data. These registered queries run against data that is always in motion, and they never terminate. If you click on an ad while using a social media app, you may notice that subsequent ads present similar products. It is likely that this app is using a streaming database to respond to the ads you click in order to determine which ones to present in the future.
Arm your team with the tools to pipeline successfully
Data pipelines are a crucial part of data management, as they automate the movement, transformation, and enrichment of data. If an organization has a strong data pipeline in place, then it can load data into its data warehouse quickly and avoid manual errors – ensuring that its information is reliable and accurate. Since most organizations rely heavily on their databases to provide key insights and real-time information, simplifying the data pipeline increases efficiency and lessens the chance of error. Numerous tools are readily available to streamline your data pipeline. The trick is choosing which tools work best for your organization.
If you’d like to learn more best practices or how to get started with Architect.io, hit up our blog:
- Microservice orchestration or choreography: Which one do you need?
- Why distributed apps need dependency management
- The basics of secret management
Play around with the platform on your own, and let us know what you think. Sign up for a free account and check out our Starter Projects. We promise you won’t regret it. Don’t be afraid to reach out to the team with any questions or comments! You can find us on Twitter @architect_team.