• 8 hours
  • Easy

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 3/15/23

Explore the Concept of a Data Pipeline

Discover the Data Pipeline

By now, it is pretty clear what data is, but where does it come from? And where does it go?

In the data world, you'll often hear about data pipelines, which can be compared to a factory. For example, in a cookie factory, you have raw ingredients (flour, butter, and sugar) going in one end. Then you have the product, cookies, coming out of the other end. Finally, in the middle, you have a transformation process, (i.e., a set of operations or processes executed in the correct sequence): 

A picture shows a data pipeline for the biscuit-making process. The raw ingredients sourced from the farms - butter, flour and sugar - undergo mixing, baking, packing to become the final product - biscuits sent to the final destination - shops.

The factory has taken raw ingredients and added value, turning them into something more refined and useful to the everyday person. Likewise, as you become more data literate, you will be able to transform raw data into something useful, just like the factory workers do with flour, sugar, and butter. So let’s start looking at the magical transformation process for your data! 

Consider the scenario of a newly health-conscious individual. Your friend Zara has bought a fitness tracker that can track her steps and distance she's moved (whether by walking, running, cycling, etc.). She hopes this is the start of a journey to better all-around health and fitness. She’s even taken up cycling! Since you have expressed your interest in working with data, Zara has asked you to help her on this journey. She wants you to use your data skills to add value to her fitness data.

What data can I extract from the app to follow Zara’s fitness journey? 

Data Sources and Raw Data

You can start by building the first stage of the data pipeline, which shows the data sources and the raw data you’re dealing with.

Let’s say the app records the number of steps and Zara’s resting heart rate.  

A picture shows a smartphone screen with an open fitness tracker app. On the app dashboard we can see the information about Zara’s heart rate, number of steps per day, water consumption, sleep quality, length of training and calories spent.You can show this on the pipeline like this: 

A picture shows the beginning of a data pipeline for Zara’s fitness tracker data. The data source is the fitness tracker app on Zara’s phone. The raw data issued from it are the number of steps and the resting heart rate.

Operations and Information

The next stage in the data pipeline shows the operations you can perform on this data to transform it into useful information. In this case, the app can produce a report that shows Zara’s achievements, such as weekly improvements or beating her previous best day.  

A picture shows the continuation of the previous data pipeline for Zara’s fitness tracker data. The operation for both raw data is creating a fitness report in the phone app. The resulting information is fitness achievements.

Destination

Then you can see the destination (i.e., where this data might go next after processing). Zara might want to share her successes with her friends! 

A picture shows the continuation of the previous data pipeline for Zara’s fitness tracker data. The destination of the information is sharing achievements with Zara’s friends.

Now you have a data pipeline for Zara’s fitness achievements. It’s like an information factory: raw data goes in one end, and refined, processed information comes out on the other.

Adding More Data to the Pipeline

Can we add more ingredients to our information factory?

Sure! Let’s ask your health-conscious friend to start recording other details about her health, particularly any worrying symptoms she has.

Zara starts to record this data in a spreadsheet that looks like this:

Date

Symptoms

February 01, 2022

 

February 02, 2022

Headache

February 03, 2022

Fatigue

February 04, 2022

 

How would you add this new data source to the data pipeline above?

Hopefully, you came up with something like this, which shows the new spreadsheet as a data source and the symptom data coming from the spreadsheet above:

A picture shows the continuation of the previous data pipeline for Zara, but with a new data source: health log. The raw data issued from it are Zara's symptoms.

Of course, Zara will want to do something useful with this new data, which we will explore in the next chapter!

Let’s Recap!

In this chapter, you learned that:

  • A data pipeline is like an information factory. You add value to raw data to transform it into information.

  • The raw data comes from different data sources.

  • Operations transform data into information.

  • You can send the final information to different data destinations. 

In this chapter, you saw the basics of turning data into low-value information. Now that you understand the process, you are ready to take it to the next level. In the next chapter, we will extract the maximum value out of data by tailoring the pipeline to specific objectives. 

Example of certificate of achievement
Example of certificate of achievement