10 Real World Data Science Case Studies Projects with Example

Top 10 Data Science Case Studies Projects with Examples and Solutions in Python to inspire your data science learning in 2023.

10 Real World Data Science Case Studies Projects with Example

BelData science has been a trending buzzword in recent times. With wide applications in various sectors like healthcare , education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. The possibilities are endless: analysis of frauds in the finance sector or the personalization of recommendations on eCommerce businesses.  We have developed ten exciting data science case studies to explain how data science is leveraged across various industries to make smarter decisions and develop innovative personalized products tailored to specific customers.

data_science_project

Walmart Sales Forecasting Data Science Project

Downloadable solution code | Explanatory videos | Tech Support

Table of Contents

Data science case studies in retail , data science case study examples in entertainment industry , data analytics case study examples in travel industry , case studies for data analytics in social media , real world data science projects in healthcare, data analytics case studies in oil and gas, what is a case study in data science, how do you prepare a data science case study, 10 most interesting data science case studies with examples.

data science case studies

So, without much ado, let's get started with data science business case studies !

With humble beginnings as a simple discount retailer, today, Walmart operates in 10,500 stores and clubs in 24 countries and eCommerce websites, employing around 2.2 million people around the globe. For the fiscal year ended January 31, 2021, Walmart's total revenue was $559 billion showing a growth of $35 billion with the expansion of the eCommerce sector. Walmart is a data-driven company that works on the principle of 'Everyday low cost' for its consumers. To achieve this goal, they heavily depend on the advances of their data science and analytics department for research and development, also known as Walmart Labs. Walmart is home to the world's largest private cloud, which can manage 2.5 petabytes of data every hour! To analyze this humongous amount of data, Walmart has created 'Data Café,' a state-of-the-art analytics hub located within its Bentonville, Arkansas headquarters. The Walmart Labs team heavily invests in building and managing technologies like cloud, data, DevOps , infrastructure, and security.

ProjectPro Free Projects on Big Data and Data Science

Walmart is experiencing massive digital growth as the world's largest retailer . Walmart has been leveraging Big data and advances in data science to build solutions to enhance, optimize and customize the shopping experience and serve their customers in a better way. At Walmart Labs, data scientists are focused on creating data-driven solutions that power the efficiency and effectiveness of complex supply chain management processes. Here are some of the applications of data science  at Walmart:

i) Personalized Customer Shopping Experience

Walmart analyses customer preferences and shopping patterns to optimize the stocking and displaying of merchandise in their stores. Analysis of Big data also helps them understand new item sales, make decisions on discontinuing products, and the performance of brands.

ii) Order Sourcing and On-Time Delivery Promise

Millions of customers view items on Walmart.com, and Walmart provides each customer a real-time estimated delivery date for the items purchased. Walmart runs a backend algorithm that estimates this based on the distance between the customer and the fulfillment center, inventory levels, and shipping methods available. The supply chain management system determines the optimum fulfillment center based on distance and inventory levels for every order. It also has to decide on the shipping method to minimize transportation costs while meeting the promised delivery date.

Here's what valued users are saying about ProjectPro

user profile

Gautam Vermani

Data Consultant at Confidential

user profile

Director Data Analytics at EY / EY Tech

Not sure what you are looking for?

iii) Packing Optimization 

Also known as Box recommendation is a daily occurrence in the shipping of items in retail and eCommerce business. When items of an order or multiple orders for the same customer are ready for packing, Walmart has developed a recommender system that picks the best-sized box which holds all the ordered items with the least in-box space wastage within a fixed amount of time. This Bin Packing problem is a classic NP-Hard problem familiar to data scientists .

Whenever items of an order or multiple orders placed by the same customer are picked from the shelf and are ready for packing, the box recommendation system determines the best-sized box to hold all the ordered items with a minimum of in-box space wasted. This problem is known as the Bin Packing Problem, another classic NP-Hard problem familiar to data scientists.

Here is a link to a sales prediction data science case study to help you understand the applications of Data Science in the real world. Walmart Sales Forecasting Project uses historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you must build a model to project the sales for each department in each store. This data science case study aims to create a predictive model to predict the sales of each product. You can also try your hands-on Inventory Demand Forecasting Data Science Project to develop a machine learning model to forecast inventory demand accurately based on historical sales data.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Amazon is an American multinational technology-based company based in Seattle, USA. It started as an online bookseller, but today it focuses on eCommerce, cloud computing , digital streaming, and artificial intelligence . It hosts an estimate of 1,000,000,000 gigabytes of data across more than 1,400,000 servers. Through its constant innovation in data science and big data Amazon is always ahead in understanding its customers. Here are a few data analytics case study examples at Amazon:

i) Recommendation Systems

Data science models help amazon understand the customers' needs and recommend them to them before the customer searches for a product; this model uses collaborative filtering. Amazon uses 152 million customer purchases data to help users to decide on products to be purchased. The company generates 35% of its annual sales using the Recommendation based systems (RBS) method.

Here is a Recommender System Project to help you build a recommendation system using collaborative filtering. 

ii) Retail Price Optimization

Amazon product prices are optimized based on a predictive model that determines the best price so that the users do not refuse to buy it based on price. The model carefully determines the optimal prices considering the customers' likelihood of purchasing the product and thinks the price will affect the customers' future buying patterns. Price for a product is determined according to your activity on the website, competitors' pricing, product availability, item preferences, order history, expected profit margin, and other factors.

Check Out this Retail Price Optimization Project to build a Dynamic Pricing Model.

iii) Fraud Detection

Being a significant eCommerce business, Amazon remains at high risk of retail fraud. As a preemptive measure, the company collects historical and real-time data for every order. It uses Machine learning algorithms to find transactions with a higher probability of being fraudulent. This proactive measure has helped the company restrict clients with an excessive number of returns of products.

You can look at this Credit Card Fraud Detection Project to implement a fraud detection model to classify fraudulent credit card transactions.

New Projects

Let us explore data analytics case study examples in the entertainment indusry.

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Data Science Interview Preparation

Netflix started as a DVD rental service in 1997 and then has expanded into the streaming business. Headquartered in Los Gatos, California, Netflix is the largest content streaming company in the world. Currently, Netflix has over 208 million paid subscribers worldwide, and with thousands of smart devices which are presently streaming supported, Netflix has around 3 billion hours watched every month. The secret to this massive growth and popularity of Netflix is its advanced use of data analytics and recommendation systems to provide personalized and relevant content recommendations to its users. The data is collected over 100 billion events every day. Here are a few examples of data analysis case studies applied at Netflix :

i) Personalized Recommendation System

Netflix uses over 1300 recommendation clusters based on consumer viewing preferences to provide a personalized experience. Some of the data that Netflix collects from its users include Viewing time, platform searches for keywords, Metadata related to content abandonment, such as content pause time, rewind, rewatched. Using this data, Netflix can predict what a viewer is likely to watch and give a personalized watchlist to a user. Some of the algorithms used by the Netflix recommendation system are Personalized video Ranking, Trending now ranker, and the Continue watching now ranker.

ii) Content Development using Data Analytics

Netflix uses data science to analyze the behavior and patterns of its user to recognize themes and categories that the masses prefer to watch. This data is used to produce shows like The umbrella academy, and Orange Is the New Black, and the Queen's Gambit. These shows seem like a huge risk but are significantly based on data analytics using parameters, which assured Netflix that they would succeed with its audience. Data analytics is helping Netflix come up with content that their viewers want to watch even before they know they want to watch it.

iii) Marketing Analytics for Campaigns

Netflix uses data analytics to find the right time to launch shows and ad campaigns to have maximum impact on the target audience. Marketing analytics helps come up with different trailers and thumbnails for other groups of viewers. For example, the House of Cards Season 5 trailer with a giant American flag was launched during the American presidential elections, as it would resonate well with the audience.

Here is a Customer Segmentation Project using association rule mining to understand the primary grouping of customers based on various parameters.

Get FREE Access to Machine Learning Example Codes for Data Cleaning , Data Munging, and Data Visualization

In a world where Purchasing music is a thing of the past and streaming music is a current trend, Spotify has emerged as one of the most popular streaming platforms. With 320 million monthly users, around 4 billion playlists, and approximately 2 million podcasts, Spotify leads the pack among well-known streaming platforms like Apple Music, Wynk, Songza, amazon music, etc. The success of Spotify has mainly depended on data analytics. By analyzing massive volumes of listener data, Spotify provides real-time and personalized services to its listeners. Most of Spotify's revenue comes from paid premium subscriptions. Here are some of the examples of case study on data analytics used by Spotify to provide enhanced services to its listeners:

i) Personalization of Content using Recommendation Systems

Spotify uses Bart or Bayesian Additive Regression Trees to generate music recommendations to its listeners in real-time. Bart ignores any song a user listens to for less than 30 seconds. The model is retrained every day to provide updated recommendations. A new Patent granted to Spotify for an AI application is used to identify a user's musical tastes based on audio signals, gender, age, accent to make better music recommendations.

Spotify creates daily playlists for its listeners, based on the taste profiles called 'Daily Mixes,' which have songs the user has added to their playlists or created by the artists that the user has included in their playlists. It also includes new artists and songs that the user might be unfamiliar with but might improve the playlist. Similar to it is the weekly 'Release Radar' playlists that have newly released artists' songs that the listener follows or has liked before.

ii) Targetted marketing through Customer Segmentation

With user data for enhancing personalized song recommendations, Spotify uses this massive dataset for targeted ad campaigns and personalized service recommendations for its users. Spotify uses ML models to analyze the listener's behavior and group them based on music preferences, age, gender, ethnicity, etc. These insights help them create ad campaigns for a specific target audience. One of their well-known ad campaigns was the meme-inspired ads for potential target customers, which was a huge success globally.

iii) CNN's for Classification of Songs and Audio Tracks

Spotify builds audio models to evaluate the songs and tracks, which helps develop better playlists and recommendations for its users. These allow Spotify to filter new tracks based on their lyrics and rhythms and recommend them to users like similar tracks ( collaborative filtering). Spotify also uses NLP ( Natural language processing) to scan articles and blogs to analyze the words used to describe songs and artists. These analytical insights can help group and identify similar artists and songs and leverage them to build playlists.

Here is a Music Recommender System Project for you to start learning. We have listed another music recommendations dataset for you to use for your projects: Dataset1 . You can use this dataset of Spotify metadata to classify songs based on artists, mood, liveliness. Plot histograms, heatmaps to get a better understanding of the dataset. Use classification algorithms like logistic regression, SVM, and Principal component analysis to generate valuable insights from the dataset.

Explore Categories

Below you will find case studies for data analytics in the travel and tourism industry.

Airbnb was born in 2007 in San Francisco and has since grown to 4 million Hosts and 5.6 million listings worldwide who have welcomed more than 1 billion guest arrivals in almost every country across the globe. Airbnb is active in every country on the planet except for Iran, Sudan, Syria, and North Korea. That is around 97.95% of the world. Using data as a voice of their customers, Airbnb uses the large volume of customer reviews, host inputs to understand trends across communities, rate user experiences, and uses these analytics to make informed decisions to build a better business model. The data scientists at Airbnb are developing exciting new solutions to boost the business and find the best mapping for its customers and hosts. Airbnb data servers serve approximately 10 million requests a day and process around one million search queries. Data is the voice of customers at AirBnB and offers personalized services by creating a perfect match between the guests and hosts for a supreme customer experience. 

i) Recommendation Systems and Search Ranking Algorithms

Airbnb helps people find 'local experiences' in a place with the help of search algorithms that make searches and listings precise. Airbnb uses a 'listing quality score' to find homes based on the proximity to the searched location and uses previous guest reviews. Airbnb uses deep neural networks to build models that take the guest's earlier stays into account and area information to find a perfect match. The search algorithms are optimized based on guest and host preferences, rankings, pricing, and availability to understand users’ needs and provide the best match possible.

ii) Natural Language Processing for Review Analysis

Airbnb characterizes data as the voice of its customers. The customer and host reviews give a direct insight into the experience. The star ratings alone cannot be an excellent way to understand it quantitatively. Hence Airbnb uses natural language processing to understand reviews and the sentiments behind them. The NLP models are developed using Convolutional neural networks .

Practice this Sentiment Analysis Project for analyzing product reviews to understand the basic concepts of natural language processing.

iii) Smart Pricing using Predictive Analytics

The Airbnb hosts community uses the service as a supplementary income. The vacation homes and guest houses rented to customers provide for rising local community earnings as Airbnb guests stay 2.4 times longer and spend approximately 2.3 times the money compared to a hotel guest. The profits are a significant positive impact on the local neighborhood community. Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on factors like the time invested by the host and responsiveness to changing demands for different seasons. The factors that impact the real-time smart pricing are the location of the listing, proximity to transport options, season, and amenities available in the neighborhood of the listing.

Here is a Price Prediction Project to help you understand the concept of predictive analysis which is widely common in case studies for data analytics. 

Uber is the biggest global taxi service provider. As of December 2018, Uber has 91 million monthly active consumers and 3.8 million drivers. Uber completes 14 million trips each day. Uber uses data analytics and big data-driven technologies to optimize their business processes and provide enhanced customer service. The Data Science team at uber has been exploring futuristic technologies to provide better service constantly. Machine learning and data analytics help Uber make data-driven decisions that enable benefits like ride-sharing, dynamic price surges, better customer support, and demand forecasting. Here are some of the real world data science projects used by uber:

i) Dynamic Pricing for Price Surges and Demand Forecasting

Uber prices change at peak hours based on demand. Uber uses surge pricing to encourage more cab drivers to sign up with the company, to meet the demand from the passengers. When the prices increase, the driver and the passenger are both informed about the surge in price. Uber uses a predictive model for price surging called the 'Geosurge' ( patented). It is based on the demand for the ride and the location.

ii) One-Click Chat

Uber has developed a Machine learning and natural language processing solution called one-click chat or OCC for coordination between drivers and users. This feature anticipates responses for commonly asked questions, making it easy for the drivers to respond to customer messages. Drivers can reply with the clock of just one button. One-Click chat is developed on Uber's machine learning platform Michelangelo to perform NLP on rider chat messages and generate appropriate responses to them.

iii) Customer Retention

Failure to meet the customer demand for cabs could lead to users opting for other services. Uber uses machine learning models to bridge this demand-supply gap. By using prediction models to predict the demand in any location, uber retains its customers. Uber also uses a tier-based reward system, which segments customers into different levels based on usage. The higher level the user achieves, the better are the perks. Uber also provides personalized destination suggestions based on the history of the user and their frequently traveled destinations.

You can take a look at this Python Chatbot Project and build a simple chatbot application to understand better the techniques used for natural language processing. You can also practice the working of a demand forecasting model with this project using time series analysis. You can look at this project which uses time series forecasting and clustering on a dataset containing geospatial data for forecasting customer demand for ola rides.

Explore More  Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

7) LinkedIn 

LinkedIn is the largest professional social networking site with nearly 800 million members in more than 200 countries worldwide. Almost 40% of the users access LinkedIn daily, clocking around 1 billion interactions per month. The data science team at LinkedIn works with this massive pool of data to generate insights to build strategies, apply algorithms and statistical inferences to optimize engineering solutions, and help the company achieve its goals. Here are some of the real world data science projects at LinkedIn:

i) LinkedIn Recruiter Implement Search Algorithms and Recommendation Systems

LinkedIn Recruiter helps recruiters build and manage a talent pool to optimize the chances of hiring candidates successfully. This sophisticated product works on search and recommendation engines. The LinkedIn recruiter handles complex queries and filters on a constantly growing large dataset. The results delivered have to be relevant and specific. The initial search model was based on linear regression but was eventually upgraded to Gradient Boosted decision trees to include non-linear correlations in the dataset. In addition to these models, the LinkedIn recruiter also uses the Generalized Linear Mix model to improve the results of prediction problems to give personalized results.

ii) Recommendation Systems Personalized for News Feed

The LinkedIn news feed is the heart and soul of the professional community. A member's newsfeed is a place to discover conversations among connections, career news, posts, suggestions, photos, and videos. Every time a member visits LinkedIn, machine learning algorithms identify the best exchanges to be displayed on the feed by sorting through posts and ranking the most relevant results on top. The algorithms help LinkedIn understand member preferences and help provide personalized news feeds. The algorithms used include logistic regression, gradient boosted decision trees and neural networks for recommendation systems.

iii) CNN's to Detect Inappropriate Content

To provide a professional space where people can trust and express themselves professionally in a safe community has been a critical goal at LinkedIn. LinkedIn has heavily invested in building solutions to detect fake accounts and abusive behavior on their platform. Any form of spam, harassment, inappropriate content is immediately flagged and taken down. These can range from profanity to advertisements for illegal services. LinkedIn uses a Convolutional neural networks based machine learning model. This classifier trains on a training dataset containing accounts labeled as either "inappropriate" or "appropriate." The inappropriate list consists of accounts having content from "blocklisted" phrases or words and a small portion of manually reviewed accounts reported by the user community.

Here is a Text Classification Project to help you understand NLP basics for text classification. You can find a news recommendation system dataset to help you build a personalized news recommender system. You can also use this dataset to build a classifier using logistic regression, Naive Bayes, or Neural networks to classify toxic comments.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Pfizer is a multinational pharmaceutical company headquartered in New York, USA. One of the largest pharmaceutical companies globally known for developing a wide range of medicines and vaccines in disciplines like immunology, oncology, cardiology, and neurology. Pfizer became a household name in 2010 when it was the first to have a COVID-19 vaccine with FDA. In early November 2021, The CDC has approved the Pfizer vaccine for kids aged 5 to 11. Pfizer has been using machine learning and artificial intelligence to develop drugs and streamline trials, which played a massive role in developing and deploying the COVID-19 vaccine. Here are a few data analytics case studies by Pfizer :

i) Identifying Patients for Clinical Trials

Artificial intelligence and machine learning are used to streamline and optimize clinical trials to increase their efficiency. Natural language processing and exploratory data analysis of patient records can help identify suitable patients for clinical trials. These can help identify patients with distinct symptoms. These can help examine interactions of potential trial members' specific biomarkers, predict drug interactions and side effects which can help avoid complications. Pfizer's AI implementation helped rapidly identify signals within the noise of millions of data points across their 44,000-candidate COVID-19 clinical trial.

ii) Supply Chain and Manufacturing

Data science and machine learning techniques help pharmaceutical companies better forecast demand for vaccines and drugs and distribute them efficiently. Machine learning models can help identify efficient supply systems by automating and optimizing the production steps. These will help supply drugs customized to small pools of patients in specific gene pools. Pfizer uses Machine learning to predict the maintenance cost of equipment used. Predictive maintenance using AI is the next big step for Pharmaceutical companies to reduce costs.

iii) Drug Development

Computer simulations of proteins, and tests of their interactions, and yield analysis help researchers develop and test drugs more efficiently. In 2016 Watson Health and Pfizer announced a collaboration to utilize IBM Watson for Drug Discovery to help accelerate Pfizer's research in immuno-oncology, an approach to cancer treatment that uses the body's immune system to help fight cancer. Deep learning models have been used recently for bioactivity and synthesis prediction for drugs and vaccines in addition to molecular design. Deep learning has been a revolutionary technique for drug discovery as it factors everything from new applications of medications to possible toxic reactions which can save millions in drug trials.

You can create a Machine learning model to predict molecular activity to help design medicine using this dataset . You may build a CNN or a Deep neural network for this data analyst case study project.

Access Data Science and Machine Learning Project Code Examples

9) Shell Data Analyst Case Study Project

Shell is a global group of energy and petrochemical companies with over 80,000 employees in around 70 countries. Shell uses advanced technologies and innovations to help build a sustainable energy future. Shell is going through a significant transition as the world needs more and cleaner energy solutions to be a clean energy company by 2050. It requires substantial changes in the way in which energy is used. Digital technologies, including AI and Machine Learning, play an essential role in this transformation. These include efficient exploration and energy production, more reliable manufacturing, more nimble trading, and a personalized customer experience. Using AI in various phases of the organization will help achieve this goal and stay competitive in the market. Here are a few data analytics case studies in the petrochemical industry:

i) Precision Drilling

Shell is involved in the processing mining oil and gas supply, ranging from mining hydrocarbons to refining the fuel to retailing them to customers. Recently Shell has included reinforcement learning to control the drilling equipment used in mining. Reinforcement learning works on a reward-based system based on the outcome of the AI model. The algorithm is designed to guide the drills as they move through the surface, based on the historical data from drilling records. It includes information such as the size of drill bits, temperatures, pressures, and knowledge of the seismic activity. This model helps the human operator understand the environment better, leading to better and faster results will minor damage to machinery used. 

ii) Efficient Charging Terminals

Due to climate changes, governments have encouraged people to switch to electric vehicles to reduce carbon dioxide emissions. However, the lack of public charging terminals has deterred people from switching to electric cars. Shell uses AI to monitor and predict the demand for terminals to provide efficient supply. Multiple vehicles charging from a single terminal may create a considerable grid load, and predictions on demand can help make this process more efficient.

iii) Monitoring Service and Charging Stations

Another Shell initiative trialed in Thailand and Singapore is the use of computer vision cameras, which can think and understand to watch out for potentially hazardous activities like lighting cigarettes in the vicinity of the pumps while refueling. The model is built to process the content of the captured images and label and classify it. The algorithm can then alert the staff and hence reduce the risk of fires. You can further train the model to detect rash driving or thefts in the future.

Here is a project to help you understand multiclass image classification. You can use the Hourly Energy Consumption Dataset to build an energy consumption prediction model. You can use time series with XGBoost to develop your model.

10) Zomato Case Study on Data Analytics

Zomato was founded in 2010 and is currently one of the most well-known food tech companies. Zomato offers services like restaurant discovery, home delivery, online table reservation, online payments for dining, etc. Zomato partners with restaurants to provide tools to acquire more customers while also providing delivery services and easy procurement of ingredients and kitchen supplies. Currently, Zomato has over 2 lakh restaurant partners and around 1 lakh delivery partners. Zomato has closed over ten crore delivery orders as of date. Zomato uses ML and AI to boost their business growth, with the massive amount of data collected over the years from food orders and user consumption patterns. Here are a few examples of data analyst case study project developed by the data scientists at Zomato:

i) Personalized Recommendation System for Homepage

Zomato uses data analytics to create personalized homepages for its users. Zomato uses data science to provide order personalization, like giving recommendations to the customers for specific cuisines, locations, prices, brands, etc. Restaurant recommendations are made based on a customer's past purchases, browsing history, and what other similar customers in the vicinity are ordering. This personalized recommendation system has led to a 15% improvement in order conversions and click-through rates for Zomato. 

You can use the Restaurant Recommendation Dataset to build a restaurant recommendation system to predict what restaurants customers are most likely to order from, given the customer location, restaurant information, and customer order history.

ii) Analyzing Customer Sentiment

Zomato uses Natural language processing and Machine learning to understand customer sentiments using social media posts and customer reviews. These help the company gauge the inclination of its customer base towards the brand. Deep learning models analyze the sentiments of various brand mentions on social networking sites like Twitter, Instagram, Linked In, and Facebook. These analytics give insights to the company, which helps build the brand and understand the target audience.

iii) Predicting Food Preparation Time (FPT)

Food delivery time is an essential variable in the estimated delivery time of the order placed by the customer using Zomato. The food preparation time depends on numerous factors like the number of dishes ordered, time of the day, footfall in the restaurant, day of the week, etc. Accurate prediction of the food preparation time can help make a better prediction of the Estimated delivery time, which will help delivery partners less likely to breach it. Zomato uses a Bidirectional LSTM-based deep learning model that considers all these features and provides food preparation time for each order in real-time. 

Data scientists are companies' secret weapons when analyzing customer sentiments and behavior and leveraging it to drive conversion, loyalty, and profits. These 10 data science case studies projects with examples and solutions show you how various organizations use data science technologies to succeed and be at the top of their field! To summarize, Data Science has not only accelerated the performance of companies but has also made it possible to manage & sustain their performance with ease.

FAQs on Data Analysis Case Studies

A case study in data science is an in-depth analysis of a real-world problem using data-driven approaches. It involves collecting, cleaning, and analyzing data to extract insights and solve challenges, offering practical insights into how data science techniques can address complex issues across various industries.

To create a data science case study, identify a relevant problem, define objectives, and gather suitable data. Clean and preprocess data, perform exploratory data analysis, and apply appropriate algorithms for analysis. Summarize findings, visualize results, and provide actionable recommendations, showcasing the problem-solving potential of data science techniques.

Access Solved Big Data and Data Science Projects

About the Author

author profile

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

Using Python for Data Analysis

Using Python for Data Analysis

Table of Contents

Understanding the Need for a Data Analysis Workflow

Setting your objectives, reading data from csv files, reading data from other sources, creating meaningful column names, dealing with missing data, handling financial columns, correcting invalid data types, fixing inconsistencies in data, correcting spelling errors, checking for invalid outliers, removing duplicate data, storing your cleansed data, performing a regression analysis, investigating a statistical distribution, finding no relationship, communicating your findings, resolving an anomaly.

Data analysis is a broad term that covers a wide range of techniques that enable you to reveal any insights and relationships that may exist within raw data. As you might expect, Python lends itself readily to data analysis. Once Python has analyzed your data, you can then use your findings to make good business decisions, improve procedures, and even make informed predictions based on what you’ve discovered.

In this tutorial, you’ll:

  • Understand the need for a sound data analysis workflow
  • Understand the different stages of a data analysis workflow
  • Learn how you can use Python for data analysis

Before you start, you should familiarize yourself with Jupyter Notebook , a popular tool for data analysis. Alternatively, JupyterLab will give you an enhanced notebook experience. You might also like to learn how a pandas DataFrame stores its data. Knowing the difference between a DataFrame and a pandas Series will also prove useful.

Get Your Code: Click here to download the free data files and sample code for your mission into data analysis with Python.

In this tutorial, you’ll use a file named james_bond_data.csv . This is a doctored version of the free James Bond Movie Dataset . The james_bond_data.csv file contains a subset of the original data with some of the records altered to make them suitable for this tutorial. You’ll find it in the downloadable materials. Once you have your data file, you’re ready to begin your first mission into data analysis.

Data analysis is a very popular field and can involve performing many different tasks of varying complexity. Which specific analysis steps you perform will depend on which dataset you’re analyzing and what information you hope to glean. To overcome these scope and complexity issues, you need to take a strategic approach when performing your analysis. This is where a data analysis workflow can help you.

A data analysis workflow is a process that provides a set of steps for your analysis team to follow when analyzing data. The implementation of each of these steps will vary depending on the nature of your analysis, but following an agreed-upon workflow allows everyone involved to know what needs to happen and to see how the project is progressing.

Using a workflow also helps futureproof your analysis methodology. By following the defined set of steps, your efforts become systematic, which minimizes the possibility that you’ll make mistakes or miss something. Furthermore, when you carefully document your work, you can reapply your procedures against future data as it becomes available. Data analysis workflows therefore also provide repeatability and scalability.

There’s no single data workflow process that suits every analysis, nor is there universal terminology for the procedures used within it. To provide a structure for the rest of this tutorial, the diagram below illustrates the stages that you’ll commonly find in most workflows:

diagram of a data analysis workflow with iterations

The solid arrows show the standard data analysis workflow that you’ll work through to learn what happens at each stage. The dashed arrows indicate where you may need to carry out some of the individual steps several times depending upon the success of your analysis. Indeed, you may even have to repeat the entire process should your first analysis reveal something interesting that demands further attention.

Now that you have an understanding of the need for a data analysis workflow, you’ll work through its steps and perform an analysis of movie data. The movies that you’ll analyze all relate to the British secret agent Bond … James Bond.

The very first workflow step in data analysis is to carefully but clearly define your objectives. It’s vitally important for you and your analysis team to be clear on what exactly you’re all trying to achieve. This step doesn’t involve any programming but is every bit as important because, without an understanding of where you want to go, you’re unlikely to ever get there.

The objectives of your data analysis will vary depending on what you’re analyzing. Your team leader may want to know why a new product hasn’t sold, or perhaps your government wants information about a clinical test of a new medical drug. You may even be asked to make investment recommendations based on the past results of a particular financial instrument. Regardless, you must still be clear on your objectives. These define your scope.

In this tutorial, you’ll gain experience in data analysis by having some fun with the James Bond movie dataset mentioned earlier. What are your objectives? Now pay attention, 007 :

  • Is there any relationship between the Rotten Tomatoes ratings and those from IMDb?
  • Are there any insights to be gleaned from analyzing the lengths of the movies?
  • Is there a relationship between the number of enemies James Bond has killed and the user ratings of the movie in which they were killed?

Now that you’ve been briefed on your mission, it’s time to get out into the field and see what intelligence you can uncover.

Acquiring Your Data

Once you’ve established your objectives, your next step is to think about what data you’ll need to achieve them. Hopefully, this data will be readily available, but you may have to work hard to get it. You may need to extract it from the data storage systems within an organization or collect survey data . Regardless, you’ll somehow need to get the data.

In this case, you’re in luck. When your bosses briefed you on your objectives, they also gave you the data in the james_bond_data.csv file. You must now spend some time becoming familiar with what you have in front of you. During the briefing, you made some notes on the content of this file:

Heading Meaning
The release date of the movie
The title of the movie
The actor playing the title role
The manufacturer of James Bond’s car
The movie’s gross US earnings
The movie’s gross worldwide earnings
The movie’s budget, in thousands of US dollars
The running time of the movie
The average user rating from IMDb
The average user rating from Rotten Tomatoes
The number of martinis that Bond drank in the movie

As you can see, you have quite a variety of data. You won’t need all of it to meet your objectives, but you can think more about this later. For now, you’ll concentrate on getting the data out of the file and into Python for cleansing and analysis.

Remember, also, it’s considered best practice to retain the original file in case you need it in the future. So you decide to create a second data file with a cleansed version of the data. This will also simplify any future analysis that may arise as a consequence of your mission.

You can obtain your data in a variety of file formats. One of the most common is the comma-separated values (CSV) file. This is a text file that separates each piece of data with commas. The first row is usually a header row that defines the file’s content, with the subsequent rows containing the actual data. CSV files have been in use for several years and remain popular because several data storage programs use them.

Because james_bond_data.csv is a text file, you can open it in any text editor. The screenshot below shows it opened in Notepad :

image of a raw csv file

As you can see, a CSV file isn’t a pleasant read. Fortunately, you rarely need to read them in their raw form.

When you need to analyze data, Python’s pandas library is a popular option. To install pandas in a Jupyter Notebook, add a new code cell and type !python -m pip install pandas . When you run the cell, you’ll install the library. If you’re working in the command line, then you use the same command, only without the exclamation point ( ! ).

With pandas installed, you can now use it to read your data file into a pandas DataFrame. The code below will do this for you:

Firstly, you import the pandas library into your program. It’s standard practice to alias pandas as pd for code to use as a reference. Next, you use the read_csv() function to read your data file into a DataFrame named james_bond_data . This will not only read your file but also take care of sorting out the headings from the data and indexing each record.

While using pd.read_csv() alone will work, you can also use .convert_dtypes() . This good practice allows pandas to optimize the data types that it uses in the DataFrame.

Suppose your CSV file contained a column of integers with missing values. By default, these will be assigned the numpy.NaN floating-point constant. This forces pandas to assign the column a float64 data type. Any integers in the column are then cast as floats for consistency.

These floating-point values could cause other undesirable floats to appear in the results of subsequent calculations. Similarly, if the original numbers were, for example, ages, then having them cast into floats probably wouldn’t be what you want.

Your use of .convert_dtypes() means that columns will be assigned one of the extension data types . Any integer columns, which were of type int , will now become the new Int64 type. This occurs because pandas.NA represents the original missing values and can be read as an Int64 . Similarly, text columns become string types, rather than the more generic object . Incidentally, floats become the new Float64 extension type, with a capital F.

After creating the DataFrame, you then decide to take a quick look at it to make sure the read has worked as you expected it to. A quick way to do this is to use .head() . This function will display the first five records for you by default, but you can customize .head() to display any number you like by passing an integer to it. Here, you decide to view the default five records:

You now have a pandas DataFrame containing the records along with their headings and a numerical index on the left-hand side. If you’re using a Jupyter Notebook, then the output will look like this:

dataframe showing initial view of data

As you can see, the Jupyter Notebook output is even more readable. However, both are much better than the CSV file that you started with.

Although CSV is a popular data file format, it isn’t particularly good. Lack of format standardization means that some CSV files contain multiple header and footer rows, while others contain neither. Also, the lack of a defined date format and the use of different separator and delimiter characters within and between data can cause issues when you read it.

Fortunately, pandas allows you to read many other formats, like JSON and Excel . It also provides web-scraping capabilities to allow you to read tables from websites. One particularly interesting and relatively new format is the column-oriented Apache Parquet file format used for handling bulk data. Parquet files are also cost-effective when working with cloud storage systems because of their compression ability.

Although having the ability to read basic CSV files is sufficient for this analysis, the downloads section provides some alternative file formats containing the same data as james_bond_data.csv . Each file is named james_bond_data , with a file-specific extension. Why not see if you can figure out how to read each of them into a DataFrame in the same way as you did with your CSV file?

If you want an additional challenge, then try scraping the Books, by publication sequence , table from Wikipedia. If you succeed, then you’ll have gained some valuable knowledge, and M will be very pleased with you .

For solutions to these challenges, expand the following collapsible sections:

How to Read a JSON File Show/Hide

To read in a JSON file, you use pd.read_json() :

As you can see, you only need to specify the JSON file that you want to read. You can also specify some interesting formatting and data conversion options if you need to. The docs page will tell you more.

How to Read an Excel File Show/Hide

Before this will work, you must install the openpyxl library. You use the command !python -m pip install openpyxl from within your Jupyter Notebook or python -m pip install openpyxl at the terminal. To read your Excel file, you then use .read_excel() :

As before, you only need to specify the filename. In cases where you’re reading from one of several worksheets, you must also specify the worksheet name by using the sheet_name argument. The docs page will tell you more.

How to Read a Parquet File Show/Hide

Before this will work, you must install a serialization engine such as pyarrow . To do this, you use the command !python -m pip install pyarrow from within your Jupyter Notebook or python -m pip install pyarrow at the terminal. To read your parquet file, you then use .read_parquet() :

As before, you only need to specify the filename. The docs page will tell you more, including how to use alternative serialization engines.

How to Web Scrape an HTML Table Show/Hide

Before this will work, you must install the lxml library to allow you to read HTML files. To do this, you use the command !python -m pip install lxml from within your Jupyter Notebook or python -m pip install lxml at the terminal. To read, or scrape , an HTML table, you use read_html() :

This time, you pass the URL of the website that you wish to scrape. The read_html() function will return a list of the tables on the web page. The one that interests you in this example is at list index 1 , but finding the one you want may require a certain amount of trial and error. The docs page will tell you more.

Now that you have your data, you might think it’s time to dive deep into it and start your analysis. While this is tempting, you can’t do it just yet. This is because your data might not yet be analyzable. In the next step, you’ll fix this.

Cleansing Your Data With Python

The data cleansing stage of the data analysis workflow is often the stage that takes the longest, particularly when there’s a large volume of data to be analyzed. It’s at this stage that you must check over your data to make sure that it’s free from poorly formatted, incorrect, duplicated, or incomplete data. Unless you have quality data to analyze, your Python data analysis code is highly unlikely to return quality results.

While you must check and re-check your data to resolve as many problems as possible before the analysis, you must also accept that additional problems could appear during your analysis. That is why there’s a possible iteration between the data cleansing and analysis stages in the diagram that you saw earlier .

The traditional way to cleanse data is by applying pandas methods separately until the data has been cleansed. While this works, it means that you create a set of intermediate DataFrame versions, each with a separate fix applied. However, this creates reproducibility problems with future cleansings because you must reapply each fix in strict order.

A better approach is for you to cleanse data by repeatedly updating the same DataFrame in memory using a single piece of code. When writing data cleansing code, you should build it up in increments and test it after writing each increment. Then, once you’ve written enough to cleanse your data fully, you’ll have a highly reusable script for cleansing of any future data that you may need to analyze. This is the approach that you’ll adopt here.

When you extract data from some systems, the column names may not be as meaningful as you’d like. It’s good practice to make sure the columns in your DataFrame are sensibly named. To keep them readable within code, you should adopt the Python variable-naming convention of using all lowercase characters, with multiple words being separated by underscores. This forces your analysis code to use these names and makes it more readable as a result.

To rename the columns in a DataFrame, you use .rename() . You pass it a Python dictionary whose keys are the original column names and whose values are the replacement names:

In the code above, you’ve replaced each of the column names with something more Pythonic. This returns a fresh DataFrame that’s referenced using the data variable, not the original DataFrame referenced by james_bond_data . The data DataFrame is the one that you’ll work with from this point forward.

Note: When analyzing data, it’s good practice to retain the raw data in its original form. This is necessary to ensure that others can reproduce your analysis to confirm its validity. Remember, it’s the raw data, not your cleansed version, that provides the real proof of your conclusions.

As with all stages in data cleansing, it’s important to test that your code has worked as you expect:

To quickly view the column labels in your DataFrame, you use the DataFrame’s .columns property. As you can see, you’ve successfully renamed the columns. You’re now ready to move on and cleanse the actual data itself.

As a starting point, you can quickly check to see if anything is missing within your data. The DataFrame’s .info() method allows you to quickly do this:

When this method runs, you see a very concise summary of the DataFrame. The .info() method has revealed that there’s missing data. The RangeIndex line near the top of the output tells you that there have been twenty-seven rows of data read into the DataFrame. However, the imdb and rotten_tomatoes columns contain only twenty-six non-null values each. Each of these columns has one piece of missing data.

You may also have noticed that some data columns have incorrect data types. To begin with, you’ll concentrate on fixing missing data. You’ll deal with the data type issues afterward.

Before you can fix these columns, you need to see them. The code below will reveal them to you:

To find rows with missing data, you can make use of the DataFrame’s .isna() method. This will analyze the data DataFrame and return a second, identically sized Boolean DataFrame that contains either True or False values, depending on whether or not the corresponding values in the data DataFrame contain <NA> or not.

Once you have this second Boolean DataFrame, you then use its .any(axis="columns") method to return a pandas Series that will contain True where rows in the second DataFrame have a True value, and False if they don’t. The True values in this Series indicate rows containing missing data, while the False values indicate where there’s no missing data.

At this point, you have a Boolean Series of values. To see the rows themselves, you can make use of the DataFrame’s .loc property. Although you usually use .loc to access subsets of rows and columns by their labels, you can also pass it your Boolean Series and get back a DataFrame containing only those rows corresponding to the True entries in the Series. These are the rows with missing data.

If you put all of this together, then you get data.loc[data.isna().any(axis="columns")] . As you can see, the output displays only one row that contains both <NA> values.

When you first saw only one row appear, you might have felt a bit shaken , but now you’re not stirred because you understand why.

JupyterLab: Nobody Does it Better Show/Hide

One of your aims is to produce code that you can reuse in the future. The previous piece of code is really only for locating duplicates and won’t be part of your final production code. If you’re working in a Jupyter Notebook, then you may be tempted to include code such as this. While this is necessary if you want to document everything that you’ve done, you’ll end up with a messy notebook that will be distracting for others to read.

If you’re working within a notebook in JupyterLab, then a good workflow tactic is to open a new console within JupyterLab against your notebook and run your test and exploratory code inside that console. You can copy any code that gives you the desired results to your Jupyter notebook, and you can discard any code that doesn’t do what you expected or that you don’t need.

To add a new console to your notebook, right-click anywhere on the running notebook and choose New Console for Notebook from the pop-up menu that appears. A new console will appear below your notebook. Type any code that you wish to experiment with into the console and tap Shift + Enter to run it. You’ll see the results appear above the code, allowing you to decide whether or not you wish to keep it.

Once your analysis is finished, you should reset and retest your entire Jupyter notebook from scratch. To do this, select Kernel → Restart Kernel → Clear Outputs of All Cells from the menu. This will reset your notebook’s kernel, removing all traces of the previous results. You can then rerun the code cells sequentially to verify that everything works correctly.

Two Jupyter notebooks are provided as part of the downloadable content, which you can get by clicking the link below:

The data_analysis_results.ipynb notebook contains a reusable version of the code for cleansing and analyzing the data, while the data_analysis_findings.ipynb notebook contains a log of the procedures used to arrive at these final results.

You can complete this tutorial using other Python environments, but Jupyter Notebook within JupyterLab is highly recommended.

To fix these errors, you need to update the data DataFrame. As you learned earlier, you’ll build up all changes temporarily in a DataFrame referenced by the data variable then write them to disk when they are all complete. You’ll now add some code to fix those <NA> values that you’ve discovered.

After doing some research , you find out the missing values are 7.1 and 6.8 , respectively. The code below will update each missing value correctly:

Here, you’ve chosen to define a DataFrame using a Python dictionary. The keys of the dictionary define its column headings, while its values define the data. Each value consists of a nested dictionary. The keys of this nested dictionary provide the row index, while the values provide the updates. The DataFrame looks like this:

Then when you call .combine_first() and pass it this DataFrame, the two missing values in the imdb and rotten_tomatoes columns in row 10 are replaced by 7.1 and 6.8 , respectively. Remember, you haven’t updated the original james_bond_data DataFrame. You’ve only changed the DataFrame referenced by the data variable.

You must now test your efforts. Go ahead and run data[data.isna().any(axis="columns")] to make sure no rows are returned. You should see an empty DataFrame.

Now you’ll fix the invalid data types. Without this, numerical analysis of your data is meaningless, if not impossible. To begin with, you’ll fix the currency columns.

The data.info() code that you ran earlier also revealed to you a subtler issue. The income_usa , income_world , movie_budget , and film_length columns all have data types of string . However, these should all be numeric types because strings are of little use for calculations. Similarly, the release column, which contains the release date, is also a string . This should be a date type.

First of all, you need to take a look at some of the data in each of the columns to learn what the problem is:

To access multiple columns, you pass a list of column names into the DataFrame’s [] operator. Although you could also use data.loc[] , using data[] alone is cleaner. Either option will return a DataFrame containing all the data from those columns. To keep things manageable, you use the .head() method to restrict the output to the first five records.

As you can see, the three financial columns each have dollar signs and comma separators, while the film_length column contains "mins" . You’ll need to remove all of this to use the remaining numbers in the analysis. These additional characters are why the data types are being misinterpreted as strings.

Although you could replace the $ sign in the entire DataFrame, this may remove it in places where you don’t want to. It’s safer if you remove it one column at a time. To do this, you can make excellent use of the .assign() method of a DataFrame. This can either add a new column to a DataFrame, or replace existing columns with updated values.

As a starting point, suppose you wanted to replace the $ symbols in the income_usa column of the data DataFrame that you’re creating. The additional code in lines 6 through 12 achieves this:

To correct the income_usa column, you define its new data as a pandas Series and pass it into the data DataFrame’s .assign() method. This method will then either overwrite an existing column with a new Series or create a new column containing it. You define the name of the column to be updated or created as a named parameter that references the new data Series. In this case, you’ll pass a parameter named income_usa .

It’s best to create the new Series using a lambda function. The lambda function used in this example accepts the data DataFrame as its argument and then uses the .replace() method to remove the $ and comma separators from each value in the income_usa column. Finally, it converts the remaining digits, which are currently of type string , to Float64 .

To actually remove the $ symbol and commas, you pass the regular expression [$,] into .replace() . By enclosing both characters in [] , you’re specifying that you want to remove all instances of both. Then you define their replacements as "" . You also set the regex parameter to True to allow [$,] to be interpreted as a regular expression.

The result of the lambda function is a Series with no $ or comma separators. You then assign this Series to the variable income_usa . This causes the .assign() method to overwrite the existing income_usa column’s data with the cleansed updates.

Take another look at the above code, and you’ll see how this all fits together. You pass .assign() a parameter named income_usa , which references a lambda function that calculates a Series containing the updated content. You assign the Series that the lambda produces to a parameter named income_usa , which tells .assign() to update the existing income_usa column with the new values.

Now go ahead and run this code to remove the offending characters from the income_usa column. Don’t forget to test your work and verify that you’ve made the replacements. Also, remember to verify that the data type of income_usa is indeed Float64 .

Note: When using assign() , it’s also possible to pass in a column directly by using .assign(income_usa=data["income_usa"]...) . However, this causes problems if the income_usa column has been changed earlier in the pipeline. These changes won’t be available for the calculation of the updated data. By using a lambda function, you force the calculation of a new set of column data based on the most recent version of that data.

Of course, it isn’t only the income_usa column that you need to work on. You also need to do the same with the income_world and movie_budget columns. You can also achieve this using the same .assign() method. You can use it to create and assign as many columns as you like. You simply pass them in as separate named arguments.

Why not go ahead and see if you can write the code that removes the same two characters from the income_world and movie_budget columns? As before, don’t forget to verify that your code has worked as you expect, but remember to check the correct columns!

Note : Remember that the parameter names you use within .assign() to update columns must be valid Python identifiers. This was another reason why changing the column names the way you did at the beginning of your data cleansing was a good idea.

Once you’ve tried your hand at resolving the remaining issues with these columns, you can reveal the solution below:

Removing Remaining Currency Symbols and Separators Show/Hide

In the code below, you’ve used the earlier code but added a lambda to remove the remaining "$" and separator strings:

Line 12 deals with the income_world data, while line 17 deals with the movie_budget data. As you can see, all three lambdas work in the same way.

Once you’ve made these corrections, remember to test your code by using data.info() . You’ll see the financial figures are no longer string types, but Float64 numbers. To view the actual changes, you can use data.head() .

With the currency colum data type now corrected, you can fix the remaining invalid types.

Next, you must remove the "mins" string from the film length values, then convert the column to the integer type. This will allow you to analyze the values. To remove the offending "mins" text, you decide to use pandas’ .str.removesuffix() Series method. This allows you to remove the string passed to it from the right-hand side of the film_length column. You can then use .astype("Int64") to take care of the data type.

Using the above information, go ahead and see if you can update the film_length column using a lambda, and add it in as another parameter to the .assign() method.

You can reveal the solution below:

Removing a Substring Show/Hide

In the code below, you’re using the earlier code but adding a lambda to remove the "mins" string starting at line 22:

As you can see, the lambda uses .removesuffix() at line 24 to update the film_length column by generating a new Series based on the data from the original film_length column but minus the "mins" string from the end of each value. To make sure you can use the column’s data as numbers, you use .astype("Int64") .

As before, test your code with the .info() and .head() methods that you used earlier. You should see the film_length column now has a more useful Int64 data type, and you’ve removed "mins" .

In addition to the problems with the financial data, you also noticed that the release_date column was being treated as a string. To convert its data into datetime format, you can use pd.to_datetime() .

To use to_datetime() , you pass the Series data["release_date"] into it, not forgetting to specify a format string to allow the date values to be interpreted correctly. Each date here is of the form June, 1962 , so in your code, you use %B followed by a comma and space to denote the position of the month names, then %Y to denote the four-digit years.

You also take the opportunity to create a new column in your DataFrame named release_year for storing the year portion of your updated data["release_date"] column data. The code to access this value is data["release_date"].dt.year . You figure that having each year separate may be useful for future analysis and even perhaps a future DataFrame index.

Using the above information, go ahead and see if you can update the release_date column to the correct type, and also create a new release_year column containing the year that each movie came out. As before, you can achieve both with .assign() and lambdas, and again as before, remember to test your efforts.

Adjusting Dates Show/Hide

In the code below, you use the earlier code with the addition of lambdas to update the release_date column’s data type and create a new column containing the release year:

As you can see, the lambda assigned to release_date on line 27 updates the release_date column, while the lambda on line 30 creates a new release_year column containing the year part of the dates from the release_date column.

As always, don’t forget to test your efforts.

Now that you’ve resolved these initial issues, you rerun data.info() to verify that you’ve fixed all of your initial concerns:

As you can see, the original twenty-seven entries now all have data in them. The release_date column has a datetime64 format, and the three earnings and film_length columns all have numeric types. There’s even a new release_year column in your DataFrame as well. Of course, this check wasn’t really necessary because, like all good secret agents, you already checked your code as you wrote it.

You may also have noticed that the column order has changed. This has happened as a result of your earlier use of combine_first() . In this analysis, the column order doesn’t matter because you never need to display the DataFrame. If necessary, you can specify a column order by using square brackets, as in data[["column_1", ...]] .

At this point, you’ve made sure nothing is missing from your data and that it’s all of the correct type. Next you turn your attention to the actual data itself.

While updating the movie_budget column label earlier, you may have noticed that its numbers appear small compared to the other financial columns. The reason is that its data is in thousands, whereas the other columns are actual figures. You decide to do something about this because it could cause problems if you compared this data with the other financial columns that you worked on.

You might be tempted to write another lambda and pass it into .assign() using the movie_budget parameter. Unfortunately, this won’t work because you can’t use the same parameter twice in the same function. You could revisit the movie_budget parameter and add functionality to multiply its result by 1000 , or you could create yet another column based on the movie_budget column values. Alternatively, you could create a separate .assign() call.

Each of these options would work, but multiplying the existing values is probably the simplest. Go ahead and see if you can multiply the results of your earlier movie_budget lambda by 1000 .

Adjusting Quantities Show/Hide

The code below is similar to the earlier version. To multiply the lambda results, you use multiplication:

The lambda starting at line 17 fixes the currency values, and you’ve adjusted it at line 21 to multiply those values by 1000 . All financial columns are now in the same units, making comparisons possible.

You can use the techniques that you used earlier to view the values in movie_budget and confirm that you’ve correctly adjusted them.

Now that you’ve sorted out some formatting issues, it’s time for you to move on and do some other checks.

One of the most difficult data cleansing tasks is checking for typos because they can appear anywhere. As a consequence, you’ll often not encounter them until late in your analysis and, indeed, may never notice them at all.

In this exercise, you’ll look for typos in the names of the actors who played Bond and in the car manufacturers’ names. This is relatively straightforward to do because both of these columns contain data items from a finite set of allowable values:

The .value_counts() method allows you to quickly obtain a count of each element within a pandas Series. Here you use it to help you find possible typos in the bond_actor column. As you can see, one instance of Sean Connery and one of Roger Moore contain typos.

To fix these with string replacement, you use the .str.replace() method of a pandas data Series. In its simplest form, you only need to pass it the original string and the string that you want to replace it with. In this case, you can replace both typos at the same time by chaining two calls to .str.replace() .

Using the above information, go ahead and see if you can correct the typos in the bond_actor column. As before, you can achieve this with a lambda.

Fixing The Actors' Names Show/Hide

In the updated code, you’ve fixed the actors’ names:

As you can see, a new lambda on line 36 updates both typos in the bond_actor column. The first .str.replace() changes all instances of Shawn to Sean , while the second one fixes the MOORE instances.

You can test that these changes have been made by rerunning the .value_counts() method once more.

As an exercise, why don’t you analyze the car manufacturer’s names and see if you can spot any typos? If there are any, use the techniques shown above to fix them.

Checking The Car Names For Typos Show/Hide

Once again, you use value_counts() to analyze the car_manufacturer column:

This time, there are two rogue entries for a car named Astin Martin . These are incorrect and need fixing:

To fix the typo, you use the same techniques as earlier, only this time you replace "Astin" with "Aston" in the car_manufacturer column. The lambda at line 41 achieves this.

Before you go any further, you should, of course, rerun the .value_counts() method against your data to validate your updates.

With the typos fixed, next you’ll see if you can find any suspicious-looking data.

The next check that you’ll perform is verifying that the numerical data is in the correct range. This again requires careful thought, because any large or small data point could be a genuine outlier, so you may need to recheck your source. But some may indeed be incorrect.

In this example, you’ll investigate the martinis that Bond consumed in each movie, as well as the length of each movie, to make sure their values are within a sensible range. There are several ways that you could analyze numerical data to check for outliers. A quick way is to use the .describe() method:

When you use .describe() on either a pandas Series or a DataFrame, it gives you a set of statistical measures relating to the Series or DataFrame’s numerical values. As you can see, .describe() has given you a range of statistical data relating to each of the two columns of the DataFrame that you called it on. These also reveal some probable errors.

Looking at the film_length column, the quartile figures reveal that most movies are around 130 minutes long, yet the mean is almost 170 minutes. The mean has been skewed by the maximum, which is a whopping 1200 minutes.

Depending on the nature of the analysis, you’d probably want to recheck your source to find out if this maximum value is indeed incorrect. In this scenario, having a movie lasting twenty hours clearly indicates a typo. After verifying your original dataset , you find 120 to be the correct value.

Turning next to the number of martinis that Bond drank during each movie, the minimum figure of -6 simply doesn’t make sense. As before, you recheck the source and find that this should be 6.

You can fix both of these errors using the .replace() method introduced earlier. For example data["martinis_consumed"].replace(-6, 6) will update the martini figures, and you can use a similar technique for the film duration. As before, you can do both using lambdas within .assign() , so why not give it a try?

You can reveal the updated cleansing code, including these latest additions, below:

Fixing Invalid Outliers Show/Hide

Now you’ve added in the two additional lambdas:

Earlier, you used a lambda to remove a "mins" string from the film_length column entries. You can’t, therefore, create a separate lambda within the same .assign() to replace the incorrect film length because doing so would mean passing in a second parameter into .assign() with the same name as this. This, of course, is illegal.

However, there’s an alternative solution that requires some lateral thinking. You could’ve created a separate .assign() method, but it’s probably more readable to keep all changes to the same column in the same .assign() .

To perform the replacement, you adjusted this existing lambda starting at line 23 to replace the invalid 1200 with 120 . You fixed the martinis column with a new lambda on line 45 that replaced -6 with 6 .

As ever, you should test these updates once more using the describe() method. You should now see sensible values for the maximum film_length and the minimum martinis_consumed columns.

Your data is almost cleansed. There is just one more thing to check and fix, and that’s the possibility that drinking too many vodka martinis has left you seeing double.

The final issue that you’ll check for is whether any of the rows of data have been duplicated. It’s usually good practice to leave this step until last because it’s possible that your earlier changes could cause duplicate data to occur. This most commonly happens when you fix strings within data because often it’s different variants of the same string that cause unwanted duplicates to occur in the first place.

The easiest way to detect duplicates is to use the DataFrame’s .duplicated() method:

By setting keep=False , the .duplicated() method will return a Boolean Series with duplicate rows marked as True . As you saw earlier, when you pass this Boolean Series into data.loc[] , the duplicate DataFrame rows are revealed to you. In your data, two rows have been duplicated. So your next step is to get rid of one instance of each row.

To get rid of duplicate rows, you call the .drop_duplicates() method on the data DataFrame that you’re building up. As its name suggests, this method will look through the DataFrame and remove any duplicate rows that it finds, leaving only one. To reindex the DataFrame sequentially, you set ignore_index=True .

See if you can figure out where to insert .drop_duplicates() in your code. You don’t use a lambda, but duplicates are removed after the call to .assign() has finished. Test your effort to make sure that you’ve indeed removed the duplicates.

Removing Duplicates Show/Hide

In the updated code, you’ve dropped the duplicate row:

As you can see, you’ve placed .drop_duplicates() on line 49, after the .assign() method has finished adjusting and creating columns.

If you rerun data.loc[data.duplicated(keep=False)] , it won’t return any rows. Each row is now unique.

You’ve now successfully identified several flaws with your data and used various techniques to cleanse them. Keep in mind that if your analysis highlights new flaws, then you may need to revisit the cleansing phase once more. On this occasion, this isn’t necessary.

With your data suitably cleansed, you might be tempted to jump in and start your analysis. But just before you start, don’t forget that other very important task that you still have left to do!

As part of your training, you’ve learned that you should save your cleansed DataFrame to a fresh file. Other analysts can then use this to save the trouble of having to recleanse the same issues once more, but you’re also allowing access to the original file in case they need it for reference. The .to_csv() method allows you to perform this good practice:

You write your cleansed DataFrame out to a CSV file named james_bond_data_cleansed.csv . By setting index=False , you’re not writing the index, only the pure data. This file will be useful to future analysts.

Note: Earlier you saw how you can source data from a variety of different file types, including Excel spreadsheets and JSON. You won’t be surprised to learn that pandas also allows you to write DataFrame content back out to these files. To do this, you use an appropriate method like .to_excel() or .to_parquet() . The input/output section of the pandas documentation contains the details.

Parquet is a great format for storing your intermediate files, because they’re compressed and support the different data types you’re working with. Its biggest disadvantage is that Parquet is not supported by all tools.

Before moving on, take a moment to reflect on what you’ve achieved up to this point. You’ve cleansed your data such that it’s now structurally sound with nothing missing, no duplicates, and no invalid data types or outliers. You’ve also removed spelling errors and inconsistencies between similar data values.

Your great effort so far not only allows you to analyze your data with confidence, but by highlighting these issues, it may be possible for you to revisit the data source and fix those issues there as well. Indeed, you can perhaps prevent similar issues from reappearing in future if you’ve highlighted a flaw in the processes for acquiring the original data.

Data cleansing really is worth putting time and effort into, and you’ve reached an important milestone. Now that you’ve tidied up and stored your data, it’s time to move on to the main part of your mission. It’s time to start meeting your objectives.

Performing Data Analysis Using Python

Data analysis is a huge topic and requires extensive study to master. However, there are four major types of analysis:

Descriptive analysis uses previous data to explain what’s happened in the past . Common examples include identifying sales trends or your customers’ behaviors.

Diagnostic analysis takes things a stage further and tries to find out why those events have happened . For example, why did the sales trend occur? And why exactly did your customers do what they did?

Predictive analysis builds on the previous analysis and uses techniques to try and predict what might happen in the future . For example, what do you expect future sales trends to do? Or what do you expect your customers to do next?

Prescriptive analysis takes everything discovered by the earlier analysis types and uses that information to formulate a future strategy . For example, you might want to implement measures to prevent sales trend predictions from falling or to prevent your customers from purchasing elsewhere.

In this tutorial, you’ll use Python to perform some descriptive analysis techniques on your james_bond_data_cleansed.csv data file to answer the questions that your boss asked earlier. It’s time to dive in and see what you can find.

The purpose of the analysis stage in the workflow diagram that you saw at the start of this tutorial is for you to process your cleansed data and extract insights and relationships from it that are of use to other interested parties. Although it’s probably your conclusions that others will be interested in, if you’re ever challenged on how you arrived at them, you have the source data to support your claims.

To complete the remainder of this tutorial, you’ll need to install both the matplotlib and scikit-learn libraries. You can do this by using python -m pip install matplotlib scikit-learn , but don’t forget to prefix it with ! if you’re using it from within a Jupyter Notebook.

During your analysis, you’ll be drawing some plots of your data. To do this, you’ll use the plotting capabilities of the Matplotlib library.

In addition, you’ll be performing a regression analysis , so you’ll need to use some tools from the scikit-learn library.

Your data contains reviews from both Rotten Tomatoes and IMDb . Your first objective is to find out if there’s a relationship between the Rotten Tomatoes ratings and those from IMDb. To do this, you’ll use a regression analysis to see if the two rating sets are related.

When performing a regression analysis, a good first step is to draw a scatterplot of the two sets of data that you’re analyzing. The shape of this plot gives you a quick visual clue as to the presence of any relationship between them, and if so, whether it’s linear , quadratic or exponential .

The code below sets you up to eventually produce a scatterplot of both ratings sets:

To begin with, you import the pandas library to allow you to read your shiny new james_bond_data_cleansed.csv into a DataFrame. You also import the matplotlib.pyplot library, which you’ll use to create the actual scatterplot.

You then use the following code to actually create the scatterplot:

Calling the subplots() function sets up an infrastructure that allows you to add one or more plots into the same figure. This won’t concern you here because you’ll only have one, but its capabilities are worth investigating.

To create the initial scatterplot, you specify the horizontal Series as the imdb column of your data and the vertical Series as the rotten_tomatoes column. The order is arbitrary here because it’s the relationship between them that interests you.

To help readers understand your plot, you next give your plot a title, and then provide sensible labels for both axes. The fig.show() code, which is optional in a Jupyter Notebook, may be needed to display your plot.

In Jupyter Notebooks, your plot should look like this:

scatterplot of both sets of rating data

The scatterplot shows a distinct slope upwards from left to right. This means that as one set of ratings increases, the other set does as well. To dig deeper and find a mathematical relationship that will allow you to estimate one set based on the other, you need to perform a regression analysis. This means that you need to expand your previous code as follows:

First of all, you import LinearRegression . As you’ll see shortly, you’ll need this to perform the linear regression calculation. You then create a pandas DataFrame and a pandas Series. Your x is a DataFrame that contains the imdb column’s data, while y is a Series that contains the rotten_tomatoes column’s data. You could potentially regress on several features, which is why x is defined as a DataFrame with a list of columns.

You now have everything you need to perform the linear regression calculations:

First of all, you create a LinearRegression instance and pass in both data sets to it using .fit() . This will perform the actual calculations for you. By default, it uses ordinary least squares (OLS) to do so.

Once you’ve created and populated the LinearRegression instance, its .score() method calculates the R-squared, or coefficient of determination, value. This measures how close the best-fit line is to the actual values. In your analysis, the R-squared value of 0.79 indicates a 79 percent accuracy between the best-fit line and the actual values. You convert it to a string named r_squared for plotting later. You round the value for neatness.

To construct a string of the equation of the best-fit straight line, you use your LinearRegression object’s .coef_ attribute to get its gradient, and its .intercept_ attribute to find the y -intercept. The equation is stored in a variable named best_fit so that you can plot it later.

Note: You may be wondering why both the model.coef_ and model.intercept_ variables have underscore suffixes. This is a scikit-learn convention to indicate variables that contain estimated values.

To get the various y coordinates that the model predicts for each given value of x , you use your model’s .predict() method and pass it the x values. You store these values in a variable named y_pred , again to allow you to plot the line later.

Finally, you produce your scatterplot:

The first three lines add the best-fit line onto the scatterplot. The text() function places the r_squared and best_fit at the coordinates passed to it, while the .plot() method adds the best-fit line, in red, to the scatterplot. As before, fig.show() isn’t needed in a Jupyter Notebook.

The Jupyter Notebook result of all of this is shown below:

screenshot of a scatterplot with linear regression line

Now that you’ve completed your regression analysis, you can use its equation to predict one rating from the other with approximately 79 percent accuracy.

Your data includes information on the running times of each of the different Bond movies. Your second objective asks you to find out if there are any insights to glean from analyzing the lengths of the movies. To do this, you’ll create a bar plot of movie timings and see if it reveals anything interesting:

This time, you create a bar plot using the plotting capabilities of pandas. While these aren’t as extensive as Matplotlib’s, they do use some of Matplotlib’s underlying functionality. You create a Series consisting of the data from the Film_Length column of your data. You then use .value_counts() to create a Series containing the count of each movie’s length. Finally, you group them into seven ranges by passing in bins=7 .

Once you’ve created the Series, you can quickly plot it using .plot.bar() . This allows you to define a title and axis labels for your plot as shown. The resulting plot reveals a very common statistical distribution:

screenshot showing a normal distribution of movie lengths

As you can see from the plot, the movie lengths resemble a normal distribution . The mean movie time sits between 122 minutes and 130 minutes, a little over two hours.

Note that neither the fig, ax = plt.subplots() nor fig.show() code is necessary in a Jupyter Notebook. Some environments may need it to allow them to display the plot.

You can find more specific statistical values if you wish:

Each pandas data Series has a useful .agg() method that allows you to pass in a list of functions. Each of these is then applied to the data in the Series. As you can see, the mean is indeed in the 122 to 130 minutes range. The standard deviation is small, meaning there isn’t much spread in the range of movie times. The minimum and maximum are 106 minutes and 163 minutes, respectively.

In this final analysis, you’ve been asked to investigate whether or not there’s any relationship between a movie’s user rating and the number of kills that Bond achieves in it.

You decide to proceed along similar lines as you did when you analyzed the relationship between the two different ratings sets. You start with a scatterplot:

The code is virtually identical to what you used in your earlier scatterplot. You decided to use the IMDb data in the analysis, but you could’ve used the Rotten Tomatoes data instead. You’ve already established that there’s a close relationship between the two, so it doesn’t matter which you choose.

This time, when you draw the scatterplot, it looks like this:

screenshot showing scatterplot of kills vs movie ratings

As you can see, the scatterplot shows you that the data is randomly distributed. This indicates that there’s no relationship between movie rating and the number of Bond kills. Whether the victim wound up on the wrong side of a Walther PPK , got sucked out of a plane, or was left to drift off into space, Bond movie fans don’t seem to care much about the number of bad guys that Bond eliminates.

When analyzing data, it’s important to realize that you may not always find something useful. Indeed, one of the pitfalls that you must avoid when performing data analysis is introducing your own bias into your data before analyzing it, and then using it to justify your preconceived conclusions. Sometimes there’s simply nothing to conclude.

At this point, you’re happy with your findings. It’s time for you to communicate them back to your bosses.

Once your data modeling is complete and you’ve obtained useful information from it, the next stage is to communicate your findings to other interested parties. After all, they’re not For Your Eyes Only . You could do this using a report or presentation. You’ll likely discuss your data sources and analysis methodology before stating your conclusions. Having the data and methodology behind your conclusions gives them authority.

You may find that once you’ve presented your findings, questions will come up that require future analysis. Once more, you may need to set additional objectives and work through the entire workflow process to resolve these new points. Look back at the diagram, and you’ll see that there’s a possible cyclic, as well as iterative, nature to a data analysis workflow.

In some cases, you may reuse your analysis methods. If so, you may consider writing some scripts that read future versions of the data, cleanse it, and analyze it in the same way that you just have. This will allow future results to be compared to yours and will add scalability to your efforts. By repeating your analysis in the future, you can monitor your original findings to see how well they stand up in the face of future data.

Alternatively, you may discover a flaw in your methodology and need to reanalyze your data differently. Again, the workflow diagram notes this possibility as well.

As you analyzed the dataset, you may have noticed that one of the James Bond Movies is missing. Take a look back and see if you can figure out which one it is. You can reveal the answer below, but no peeking! Also, if you run data["bond_actor"].value_counts() you may be surprised to find that Sean Connery played Bond only six times to Roger Moore’s seven. Or did he?

Bond Is Back! Show/Hide

The dataset that you’re using in this tutorial doesn’t include Never Say Never Again . This movie wasn’t considered an official part of the James Bond franchise. However, it did star Sean Connery in the title role. So technically, both Connery and Moore have played Bond 007 times each.

That’s it, your mission is complete. M is very pleased. As a reward, he instructs Q to give you a pen that turns into a helicopter. Always a handy tool to have for tracking down future data for analysis.

You’ve now gained experience in using a data analysis workflow to analyze some data and draw conclusions from your findings. You understand the main stages in a data analysis workflow and the reasons for following them. As you learn more advanced analysis techniques in the future, you can still use the key skills that you learned here to make sure your future data analysis projects progress thoroughly and efficiently.

In this tutorial, you’ve learned:

  • The importance of a data analysis workflow
  • The purpose of the main stages in a data analysis workflow
  • Common techniques for cleansing data
  • How to use some common data analysis methods to meet objectives
  • How to display the results of a data analysis graphically.

You should consider learning more data analysis techniques and practicing your skills using them. If you’ve done any further analysis on the James Bond data used here, then feel free to share your interesting findings in the comments section below. In fact, try finding something to share with us that’s shocking. Positively shocking.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Ian Eyre

Ian Eyre

Ian is an avid Pythonista and Real Python contributor who loves to learn and teach others.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: intermediate best-practices data-science python

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Using Python for Data Analysis (Sample Code)

🔒 No spam. We take your privacy seriously.

case study analysis in python

Thinking Neuron banner Logo

Python Case Studies

Machine Learning business case studies solved using python. These are examples of how you can solve similar use cases for your own project and deploy the models into production.

I have discussed below points in each of the case studies.

  • How to explore the given data?
  • How to perform data pre-processing (missing values, outliers, transformations, etc.)
  • How to create new columns based on existing columns (Feature Engineering)?
  • How to select the best columns for machine learning (Feature Selection)?
  • How to find the best ML algorithm for the given data?
  • How to tune the predictive models.
  • How to deploy predictive models into production?
  • What happens after the model deployment?

Time Series Use Cases

  • Time Series Forecasting Forecasting monthly sales quantity for Superstore dataset

NLP Use Cases

  • TF-IDF Text classification Support Ticket Classification using TF-IDF Vectorization
  • Sentiment Analysis using BERT Finding the sentiment of Indigo flight passengers based on their tweets
  • Transfer Learning using GloVe Microsoft Ticket classification using GloVe
  • Text classification using Word2Vec How to create classification models on text data using word2vec encodings

Regression Use Cases

  • Zomato restaurant rating How to predict the future rating of a restaurant based on an ML model. A Case study in python.
  • Predicting diamond prices Creating an ML model to predict the apt price of a given diamond.
  • Evaluating old car price Predicting the right price for an old car using python machine learning.
  • Bike rental demand prediction Create an ML model to forecast the demand of rental bikes every hour of the day.
  • Computer price prediction Estimating the price of a computer, based on its specs.
  • Concrete strength prediction How strong will this concrete be? Predicting the strength of concrete based on its mixture details.
  • Boston housing price prediction House price prediction case study on the famous Boston data.

Classification Use Cases

  • Loan Classification A predictive model to approve/reject a new loan application.
  • German Credit Risk Classification of a loan as a potential risk or safe for the bank.
  • Salary Band Classification Identify if you deserve a salary more than $50,000 or not .
  • Titanic survival A case study to see what type of passengers survived the titanic crash.
  • Advertisement Click Prediction A case study to predict if a user will click on advertisements or not

Deep Learning Use Cases

  • ANN-Regression Creating an Artificial Neural Network model for Regression
  • ANN-Classification Creating an Artificial Neural Network Model for Classification
  • LSTM Predicting Infosys stock price using Long Short Term Memory network
  • CNN Creating a face recognition model using the Convolution Neural Network

logo

Python Case Studies

Introduction, introduction #.

This project is a collection of six captivating case studies that use Python and computational techniques to analyse data, build classification models and unravel insights on multifaceted datasets. The array of topics touch on different domains of knowledge.

Central to these studies is the application of tools and concepts. Ranging from working with fundamental Python objects like dataframes and numpy arrays, to using advanced tools like scikit-learn. The case studies deal with different topics - working with bird migration data to untable mysteries of flight, to deciphering social network dynamics among villagers, exploring diverse whisky flavors, and analyzing book data.

This project has lead me to acquire new technical skills, develope my analytical thinking and solidify my problem-solving skills while using Python’s potential to extract knowledge from diverse datasets.

Here is the table of contents:

  • Exploring DNA Sequencing and Analysis
  • Linguistic Analysis of Books: A Study of Language Variability
  • Predictive Insights through Classification Analysis
  • Analysis of Scotch Whisky Production and Flavor Profiles
  • Analyzing Bird Migration Patterns with GPS Data
  • Deep Dive into Network Analysis and Social Relationships
  • Bibliography

thecleverprogrammer

Data Science Case Studies: Solved using Python

Aman Kharwal

  • February 19, 2021
  • Machine Learning

Solving a Data Science case study means analyzing and solving a problem statement intensively. Solving case studies will help you show unique and amazing data science use cases in your portfolio. In this article, I’m going to introduce you to 3 data science case studies solved and explained using Python.

Data Science Case Studies

If you’ve learned data science by taking a course or certification program, you’re still not that close to finding a job easily. The most important point of your Data Science interview is to show how you can use your skills in real use cases. Below are 3 data science case studies that will help you understand how to analyze and solve a problem. All of the data science case studies mentioned below are solved and explained using Python.

Case Study 1:  Text Emotions Detection

If you are one of them who is having an interest in natural language processing then this use case is for you. The idea is to train a machine learning model to generate emojis based on an input text. Then this machine learning model can be used in training Artificial Intelligent Chatbots.

Use Case:   A human can express his emotions in any form, such as the face, gestures, speech and text. The detection of text emotions is a content-based classification problem. Detecting a person’s emotions is a difficult task, but detecting the emotions using text written by a person is even more difficult as a human can express his emotions in any form. 

Recognizing this type of emotion from a text written by a person plays an important role in applications such as chatbots, customer support forum, customer reviews etc. So you have to train a machine learning model that can identify the emotion of a text by presenting the most relevant emoji according to the input text.

data science case studies

Case Study 2:  Hotel Recommendation System

A hotel recommendation system typically works on collaborative filtering that makes recommendations based on ratings given by other customers in the same category as the user looking for a product.

Use Case:   We all plan trips and the first thing to do when planning a trip is finding a hotel. There are so many websites recommending the best hotel for our trip. A hotel recommendation system aims to predict which hotel a user is most likely to choose from among all hotels. So to build this type of system which will help the user to book the best hotel out of all the other hotels. We can do this using customer reviews.

For example, suppose you want to go on a business trip, so the hotel recommendation system should show you the hotels that other customers have rated best for business travel. It is therefore also our approach to build a recommendation system based on customer reviews and ratings. So use the ratings and reviews given by customers who belong to the same category as the user and build a hotel recommendation system.

use cases

Case Study 3:  Customer Personality Analysis

The analysis of customers is one of the most important roles that a data scientist has to do who is working at a product based company. So if you are someone who wants to join a product based company then this data science case study is best for you.

Use Case:   Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviours and concerns of different types of customers.

You have to do an analysis which should help a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

case studies

So these three data science case studies are based on real-world problems, starting with the first; Text Emotions Detection, it is completely based on natural language processing and the machine learning model trained by you will be used in training an AI chatbot. The second use case; Hotel Recommendation System, is also based on NLP, but here you will understand how to generate recommendations using collaborative filtering. The last use case; customer personality analysis, is based on someone who wants to focus on the analysis part.

All these data science case studies are solved using Python, here are the resources where you will find these use cases solved and explained:

  • Text Emotions Detection
  • Hotel Recommendation System
  • Customer Personality Analysis

I hope you liked this article on data science case studies solved and explained using the Python programming language. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal

Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Recommended For You

Data Analytics Guided Projects with Python

Data Analytics Guided Projects with Python

  • August 9, 2024

Essential SQL Topics for Data Science

Essential SQL Topics for Data Science

  • August 6, 2024

Data Science Guided Projects with Python

Data Science Guided Projects with Python

  • August 1, 2024

Platforms to Find Datasets for Real-World Problems

Platforms to Find Datasets for Real-World Problems

One comment.

[…] there is no need for any academic or professional qualifications, you should have projects based on practical use cases in your portfolio to get your first data science […]

Leave a Reply Cancel reply

Discover more from thecleverprogrammer.

Subscribe now to keep reading and get access to the full archive.

Type your email…

Continue reading

Cookie Policy

We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy .

By clicking "Accept" or further use of this website, you agree to allow cookies.

  • Data Science
  • Data Analytics
  • Machine Learning

Essential Statistics for Data Science: A Case Study using Python, Part I

Essential Statistics for Data Science: A Case Study using Python, Part I

Get to know some of the essential statistics you should be very familiar with when learning data science

LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.

Our last post dove straight into linear regression. In this post, we'll take a step back to cover essential statistics that every data scientist should know. To demonstrate these essentials, we'll look at a hypothetical case study involving an administrator tasked with improving school performance in Tennessee.

You should already know:

  • Python fundamentals — learn on dataquest.io

Note, this tutorial is intended to serve solely as an educational tool and not as a scientific explanation of the causes of various school outcomes in Tennessee .

Article Resources

  • Notebook and Data: Github
  • Libraries: pandas, matplotlib, seaborn

Introduction

Meet Sally, a public school administrator. Some schools in her state of Tennessee are performing below average academically. Her superintendent, under pressure from frustrated parents and voters, approached Sally with the task of understanding why these schools are under-performing. Not an easy problem, to be sure.

To improve school performance, Sally needs to learn more about these schools and their students, just as a business needs to understand its own strengths and weaknesses and its customers.

Though Sally is eager to build an impressive explanatory model, she knows the importance of conducting preliminary research to prevent possible pitfalls or blind spots (e.g. cognitive bias'). Thus, she engages in a thorough exploratory analysis, which includes: a lit review, data collection, descriptive and inferential statistics, and data visualization.

Sally has strong opinions as to why some schools are under-performing, but opinions won't do, nor will a handful of facts; she needs rigorous statistical evidence.

Sally conducts a lit review, which involves reading a variety of credible sources to familiarize herself with the topic. Most importantly, Sally keeps an open mind and embraces a scientific world view to help her resist confirmation bias (seeking solely to confirm one's own world view).

In Sally's lit review, she finds multiple compelling explanations of school performance: curriculae , income , and parental involvement . These sources will help Sally select her model and data, and will guide her interpretation of the results.

Data Collection

The data we want isn't always available, but Sally lucks out and finds student performance data based on test scores ( school_rating ) for every public school in middle Tennessee. The data also includes various demographic, school faculty, and income variables (see readme for more information). Satisfied with this dataset, she writes a web-scraper to retrieve the data.

But data alone can't help Sally; she needs to convert the data into useful information.

Descriptive and Inferential Statistics

Sally opens her stats textbook and finds that there are two major types of statistics, descriptive and inferential.

Descriptive statistics identify patterns in the data, but they don't allow for making hypotheses about the data.

Within descriptive statistics, there are two measures used to describe the data: central tendency and deviation . Central tendency refers to the central position of the data (mean, median, mode) while the deviation describes how far spread out the data are from the mean. Deviation is most commonly measured with the standard deviation. A small standard deviation indicates the data are close to the mean, while a large standard deviation indicates that the data are more spread out from the mean.

Inferential statistics allow us to make hypotheses (or inferences ) about a sample that can be applied to the population. For Sally, this involves developing a hypothesis about her sample of middle Tennessee schools and applying it to her population of all schools in Tennessee.

For now, Sally puts aside inferential statistics and digs into descriptive statistics.

To begin learning about the sample, Sally uses pandas' describe method, as seen below. The column headers in bold text represent the variables Sally will be exploring. Each row header represents a descriptive statistic about the corresponding column.

school_ratingsizereduced_lunchstate_percentile_16state_percentile_15stu_teach_ratioavg_score_15avg_score_16full_time_teacherspercent_blackpercent_whitepercent_asianpercent_hispanic
count347.000000347.000000347.000000347.000000341.000000347.000000341.000000347.000000347.000000347.000000347.000000347.000000347.000000
mean2.968300699.47262250.27953958.80172958.24926715.46167157.00469257.04985644.93948121.19798361.6734872.64265111.164553
std1.690377400.59863625.48023632.54074732.7026305.72517026.69645027.96897422.05338623.56253827.2748593.10962912.030608
min0.00000053.0000002.0000000.2000000.6000004.7000001.5000000.1000002.0000000.0000001.1000000.0000000.000000
25%2.000000420.50000030.00000030.95000027.10000013.70000037.60000037.00000030.0000003.60000040.6000000.7500003.800000
50%3.000000595.00000051.00000066.40000065.80000015.00000061.80000060.70000040.00000013.50000068.7000001.6000006.400000
75%4.000000851.00000071.50000088.00000088.60000016.70000079.60000080.25000054.00000028.35000085.9500003.10000013.800000
max5.0000002314.00000098.00000099.80000099.800000111.00000099.00000098.900000140.00000097.40000099.70000021.10000065.200000

Looking at the output above, Sally's variables can be put into two classes: measurements and indicators.

Measurements are variables that can be quantified. All data in the output above are measurements. Some of these measurements, such as state_percentile_16 , avg_score_16 and school_rating , are outcomes; these outcomes cannot be used to explain one another. For example, explaining school_rating as a result of state_percentile_16 (test scores) is circular logic. Therefore we need a second class of variables.

The second class, indicators, are used to explain our outcomes. Sally chooses indicators that describe the student body (for example, reduced_lunch ) or school administration ( stu_teach_ratio ) hoping they will explain school_rating .

Sally sees a pattern in one of the indicators, reduced_lunch . reduced_lunch is a variable measuring the average percentage of students per school enrolled in a federal program that provides lunches for students from lower-income households. In short, reduced_lunch is a good proxy for household income, which Sally remembers from her lit review was correlated with school performance.

Sally isolates reduced_lunch and groups the data by school_rating using pandas' groupby method and then uses describe on the re-shaped data (see below).

reduced_lunch
countmeanstdmin25%50%75%max
school_rating
0.043.083.5813958.81349853.079.5086.090.0098.0
1.040.074.95000011.64419153.065.0074.584.2598.0
2.044.064.27272711.95605137.054.7562.574.0088.0
3.056.050.28571413.55086624.041.0048.563.0078.0
4.086.041.00000016.6810924.030.0041.550.0087.0
5.078.021.60256417.6512682.08.0019.029.7587.0

Below is a discussion of the metrics from the table above and what each result indicates about the relationship between school_rating and reduced_lunch :

count : the number of schools at each rating. Most of the schools in Sally's sample have a 4- or 5-star rating, but 25% of schools have a 1-star rating or below. This confirms that poor school performance isn't merely anecdotal, but a serious problem that deserves attention.

mean : the average percentage of students on reduced_lunch among all schools by each school_rating . As school performance increases, the average number of students on reduced lunch decreases. Schools with a 0-star rating have 83.6% of students on reduced lunch. And on the other end of the spectrum, 5-star schools on average have 21.6% of students on reduced lunch. We'll examine this pattern further. in the graphing section.

std : the standard deviation of the variable. Referring to the school_rating of 0, a standard deviation of 8.813498 indicates that 68.2% (refer to readme ) of all observations are within 8.81 percentage points on either side of the average, 83.6%. Note that the standard deviation increases as school_rating increases, indicating that reduced_lunch loses explanatory power as school performance improves. As with the mean, we'll explore this idea further in the graphing section.

min : the minimum value of the variable. This represents the school with the lowest percentage of students on reduced lunch at each school rating. For 0- and 1-star schools, the minimum percentage of students on reduced lunch is 53%. The minimum for 5-star schools is 2%. The minimum value tells a similar story as the mean, but looking at it from the low end of the range of observations.

25% : the bottom quartile; represents the lowest 25% of values for the variable, reduced_lunch . For 0-star schools, 25% of the observations are less than 79.5%. Sally sees the same trend in the bottom quartile as the above metrics: as school_rating increases the bottom 25% of reduced_lunch decreases.

50% : the second quartile; represents the lowest 50% of values. Looking at the trend in school_rating and reduced_lunch , the same relationship is present here.

75% : the top quartile; represents the lowest 75% of values. The trend continues.

max : the maximum value for that variable. You guessed it: the trend continues!

The descriptive statistics consistently reveal that schools with more students on reduced lunch under-perform when compared to their peers. Sally is on to something.

Sally decides to look at reduced_lunch from another angle using a correlation matrix with pandas' corr method. The values in the correlation matrix table will be between -1 and 1 (see below). A value of -1 indicates the strongest possible negative correlation, meaning as one variable decreases the other increases. And a value of 1 indicates the opposite. The result below, -0.815757, indicates strong negative correlation between reduced_lunch and school_rating . There's clearly a relationship between the two variables.

reduced_lunchschool_rating
reduced_lunch1.000000-0.815757
school_rating-0.8157571.000000

Sally continues to explore this relationship graphically.

Essential Graphs for Exploring Data

Box-and-whisker plot.

In her stats book, Sally sees a box-and-whisker plot . A box-and-whisker plot is helpful for visualizing the distribution of the data from the mean. Understanding the distribution allows Sally to understand how far spread out her data is from the mean; the larger the spread from the mean, the less robust reduced_lunch is at explaining school_rating .

See below for an explanation of the box-and-whisker plot.

case study analysis in python

Now that Sally knows how to read the box-and-whisker plot, she graphs reduced_lunch to see the distributions. See below.

case study analysis in python

In her box-and-whisker plots, Sally sees that the minimum and maximum reduced_lunch values tend to get closer to the mean as school_rating decreases; that is, as school_rating decreases so does the standard deviation in reduced_lunch .

What does this mean?

Starting with the top box-and-whisker plot, as school_rating decreases, reduced_lunch becomes a more powerful way to explain outcomes. This could be because as parents' incomes decrease they have fewer resources to devote to their children's education (such as, after-school programs, tutors, time spent on homework, computer camps, etc) than higher-income parents. Above a 3-star rating, more predictors are needed to explain school_rating due to an increasing spread in reduced_lunch .

Having used box-and-whisker plots to reaffirm her idea that household income and school performance are related, Sally seeks further validation.

Scatter Plot

To further examine the relationship between school_rating and reduced_lunch , Sally graphs the two variables on a scatter plot. See below.

case study analysis in python

In the scatter plot above, each dot represents a school. The placement of the dot represents that school's rating (Y-axis) and the percentage of its students on reduced lunch (x-axis).

The downward trend line shows the negative correlation between school_rating and reduced_lunch (as one increases, the other decreases). The slope of the trend line indicates how much school_rating decreases as reduced_lunch increases. A steeper slope would indicate that a small change in reduced_lunch has a big impact on school_rating while a more horizontal slope would indicate that the same small change in reduced_lunch has a smaller impact on school_rating .

Sally notices that the scatter plot further supports what she saw with the box-and-whisker plot: when reduced_lunch increases, school_rating decreases. The tighter spread of the data as school_rating declines indicates the increasing influence of reduced_lunch . Now she has a hypothesis.

Correlation Matrix

Sally is ready to test her hypothesis: a negative relationship exists between school_rating and reduced_lunch (to be covered in a follow up article). If the test is successful, she'll need to build a more robust model using additional variables. If the test fails, she'll need to re-visit her dataset to choose other variables that possibly explain school_rating . Either way, Sally could benefit from an efficient way of assessing relationships among her variables.

An efficient graph for assessing relationships is the correlation matrix, as seen below; its color-coded cells make it easier to interpret than the tabular correlation matrix above. Red cells indicate positive correlation; blue cells indicate negative correlation; white cells indicate no correlation. The darker the colors, the stronger the correlation (positive or negative) between those two variables.

case study analysis in python

With the correlation matrix in mind as a future starting point for finding additional variables, Sally moves on for now and prepares to test her hypothesis.

Sally was approached with a problem: why are some schools in middle Tennessee under-performing? To answer this question, she did the following:

  • Conducted a lit review to educate herself on the topic.
  • Gathered data from a reputable source to explore school ratings and characteristics of the student bodies and schools in middle Tennessee.
  • The data indicated a robust relationship between school_rating and reduced_lunch .
  • Explored the data visually.
  • Though satisfied with her preliminary findings, Sally is keeping her mind open to other explanations.
  • Developed a hypothesis: a negative relationship exists between school_rating and reduced_lunch .

In a follow up article, Sally will test her hypothesis. Should she find a satisfactory explanation for her sample of schools, she will attempt to apply her explanation to the population of schools in Tennessee.

Course Recommendations

Further learning:, applied data science with python — coursera, statistics and data science micromasters — edx, get updates in your inbox.

Join over 7,500 data science learners.

Recent articles:

The 9 best ai courses online for 2024: beginner to advanced, the 6 best python courses for 2024 – ranked by software engineer, best course deals for black friday and cyber monday 2024, sigmoid function, 7 best artificial intelligence (ai) courses.

Top courses you can take today to begin your journey into the Artificial Intelligence field.

Meet the Authors

Tim Dobbins LearnDataSci Author

A graduate of Belmont University, Tim is a Nashville, TN-based software engineer and statistician at Perception Health, an industry leader in healthcare analytics, and co-founder of Sidekick, LLC, a data consulting company. Find him on  Twitter  and  GitHub .

John Burke Data Scientist Author @ Learn Data Sci

John is a research analyst at Laffer Associates, a macroeconomic consulting firm based in Nashville, TN. He graduated from Belmont University. Find him on  GitHub  and  LinkedIn

Back to blog index

Analyze Traffic Safety Data with Python Case Study

Create data visualizations of traffic data from the last two decades and model the relationship between smartphones and collision rates.

  • AI assistance for guided coding help
  • Projects to apply new skills
  • A certificate of completion

case study analysis in python

Skill level

Time to complete

Prerequisites

About this course

Ready to start your journey into analyzing real data with Python? Take Analyze Traffic Safety Data with Python Case Study — you will explore two real datasets, plot trends over time, and model and predict traffic safety outcomes.

Skills you'll gain

Practice making line plots, scatter plots, and box plots

Analyze a correlation visually and statistically

Predict outcomes with a linear regression model

Traffic Safety Case Study

Visualize traffic safety trends and predict crash rates from smartphone usage

Certificate of completion available with Plus or Pro

The platform

Hands-on learning

An AI-generated hint within the instructions of a Codecademy project

Projects in this course

Analyze traffic safety data with python.

case study analysis in python

Earn a certificate of completion

  • Show proof Receive a certificate that demonstrates you've completed a course or path.
  • Build a collection The more courses and paths you complete, the more certificates you collect.
  • Share with your network Easily add certificates of completion to your LinkedIn profile to share your accomplishments.

case study analysis in python

Analyze Traffic Safety Data with Python Case Study course ratings and reviews

  • 5 stars 63%
  • 4 stars 25%

Our learners work at

  • Google Logo
  • Amazon Logo
  • Microsoft Logo
  • Reddit Logo
  • Spotify Logo
  • YouTube Logo
  • Instagram Logo

Join over 50 million learners and start Analyze Traffic Safety Data with Python Case Study today!

Looking for something else, related resources, interpreting website analytics, what is python, visualizing time series data with python, related courses and paths, data scientist: analytics specialist, data scientist: inference specialist, learn the basics of programming with codecademy, browse more topics.

  • Python 3,393,048 learners enrolled
  • Data Analytics 2,201,761 learners enrolled
  • Data Science 4,196,854 learners enrolled
  • Machine Learning 609,820 learners enrolled
  • Code Foundations 7,004,333 learners enrolled
  • Computer Science 5,449,361 learners enrolled
  • Web Development 4,678,931 learners enrolled
  • For Business 3,056,421 learners enrolled
  • JavaScript 2,737,906 learners enrolled

Two people in conversation while learning to code with Codecademy on their laptops

Unlock additional features with a paid plan

Practice projects, assessments, certificate of completion.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Here are 32 public repositories matching this topic...

Prateekiiest / code-sleep-python.

Awesome Projects in Python - Machine Learning Applications, Games, Desktop Applications all in Python 🐍

  • Updated Sep 17, 2023

TatevKaren / artificial-neural-network-business_case_study

Business Case Study to predict customer churn rate based on Artificial Neural Network (ANN), with TensorFlow and Keras in Python. This is a customer churn analysis that contains training, testing, and evaluation of an ANN model. (Includes: Case Study Paper, Code)

  • Updated May 4, 2021

TatevKaren / convolutional-neural-network-image_recognition_case_study

Computer Vision Case Study in image recognition to classify an image to a binary class, based on Convolutional Neural Networks (CNN), with TensorFlow and Keras in Python, to identify from an image whether it is an image of a dog or cat. (Includes: Data, Case Study Paper, Code)

  • Updated Apr 18, 2022

Sofosss / DockerTutorialGR

DockerTutorialGR: A beginner's tutorial on Docker in Greek. It provides concise guidance on Docker fundamentals, including the creation of Dockerfiles, building images, running containers and sharing data between them.

  • Updated Apr 16, 2024

JanTeichertKluge / DMLSim

This library provides packages on DoubleML / Causal Machine Learning and Neural Networks in Python for Simulation and Case Studies.

  • Updated Jun 20, 2023

BamdadTabari / illegible_python_collection

lots of pretty and simple python tools, for everyone who loves python.

  • Updated Aug 26, 2022

carlocontaldi / trivago_case_study

Data science case study: Click Rate Prediction using real-world Trivago data.

  • Updated Jan 7, 2019

abhisheksaxena1998 / How-world-reacted-to-coronavirus-casestudy

This case-study investigates what was world's reaction on Twitter during covid-19 pandemic. Tweets over a period of 7 months were classified as hate, offensive along with emotions involved.

  • Updated Oct 8, 2020

lcvriend / toponym_extraction

| thesis project | Toponym extraction from LexisNexis data using named entity recognition

  • Updated Feb 23, 2020

RickBarretto / toCase

toCase is a Case converter made in python, for peoples who wants simplify this feature. It can convert to and from Camel, Pascal, Snake, Kebab and Strings Sentences.

  • Updated Jan 11, 2022

terodea / CS-BigData

Learn Big Data tools/ framework by doing examples, POC, per projects.

  • Updated Jul 29, 2022

meetpatel0963 / KMeans-Spark

KMeans Clustering using Spark on Uber's ride share data - Case Study (Big Data Analytics @uber )

  • Updated Jul 16, 2024

carlocontaldi / hellofresh_case_study

Data science case study: Churn Prediction using real-world HelloFresh data.

gabrieldluca / bird-migration

Tracking bird migrations through GPS data provided by the LifeWatch INBO project.

  • Updated Aug 16, 2017

Prajna-Ramamurthy / Sanskrit-To-English-Karaka-Translator

The project for the summer course Understanding Natural Languages: Samskrita as a Case Study

  • Updated Sep 1, 2023

Vitorrrocha / ComparingRnaSequences

Case study: comparing ribosomal RNA sequences between humans and bacteria.

  • Updated May 13, 2020

Amir-SaberHabibi / NTFA

a Graph-based Integration of Network Traffic Flow Analysis: a Case Study.

  • Updated Jul 21, 2024

rakifsul / studi_kasus_python_3_password_generator

Studi Kasus Python 3 Aplikasi Password Generator

  • Updated Jun 30, 2024

pleiadev24 / spotify-stats

Data Analysis Case study of Spotify Dataset (Mathplotlib)

  • Updated Nov 9, 2023

ChoiHongYong / Hong

  • Updated Aug 21, 2018

Improve this page

Add a description, image, and links to the case-study topic page so that developers can more easily learn about it.

Curate this topic

Tutorial Playlist

Data analytics tutorial for beginners: a step-by-step guide, what is data analytics and its future scope in 2024, data analytics with python: use case demo, what is exploratory data analysis steps and market analysis, top 5 business intelligence tools, the ultimate guide to qualitative vs. quantitative research, how to become a data analyst navigating the data landscape, data analyst vs. data scientist: the ultimate comparison, top 60 data analyst interview questions and answers for 2024, understanding the fundamentals of confidence interval in statistics, applications of data analytics: real-world applications and impact, chatgpt use-cases: the ultimate guide to using chatgpt by openai, the best spotify data analysis project you need to know, data analytics with python: use case demo.

Lesson 2 of 12 By Avijeet Biswal

Data Analytics with Python: Use Case Demo

Table of Contents

Data is getting generated at a massive rate, by the minute. Organizations, on the other hand, are trying to explore every opportunity to make sense of this data. This is where Data analytics has become crucial in running a business successfully. It is commonly used in companies to drive profit and business growth. In this article, we’ll learn Data analytics using Python .

What is Data Analytics?

Data analytics is the process of exploring and analyzing large datasets to make predictions and boost data-driven decision making. Data analytics allows us to collect, clean, and transform data to derive meaningful insights. It helps to answer questions, test hypotheses, or disprove theories. 

Let’s understand the various applications of data analytics.

Become a Data Scientist with Hands-on Training!

Become a Data Scientist with Hands-on Training!

Applications of Data Analytics

Data analytics is used in most sectors of businesses. Here are some primary areas where data analytics does its magic:

analytics-python

  • Data analytics is used in the banking and e-commerce industries to detect fraudulent transactions.
  • The healthcare sector uses data analytics to improve patient health by detecting diseases before they happen. It is commonly used for cancer detection.
  • Data analytics finds its usage in inventory management to keep track of different items.
  • Logistics companies use data analytics to ensure faster delivery of products by optimizing vehicle routes.
  • Marketing professionals use analytics to reach out to the right customers and perform targeted marketing to increase ROI.
  • Data analytics can be used for city planning, to build smart cities.

Types of Data Analytics

Data analytics can be broadly classified into 3 types:

1. Descriptive Analytics

It tells you what has happened. It can be done using an exploratory data analysis.

Example: Studying the total units of chairs sold and the profit that was made in the past.

2. Predictive Analytics

It tells you what will happen. It can be achieved by building predictive models.

Example: Predicting the total units of chairs that would sell and the profit we can expect in the future.

3. Prescriptive Analytics

It tells you how to make something happen. It can be done by deriving key insights and hidden patterns from the data.

Example: Finding ways to improve sales and profit of chairs.

The graph below represents the difficulty level and values the can be derived from the different types of data analytics.

difficulty

Data Analytics Process Steps

There are primarily five steps involved in the data analytics process, which include:

  • Data Collection : The first step in data analytics is to collect or gather relevant data from multiple sources. Data can come from different databases, web servers, log files, social media, excel and CSV files, etc.
  • Data Preparation : The next step in the process is to prepare the data. It involves cleaning the data to remove unwanted and redundant values, converting it into the right format, and making it ready for analysis. It also requires data wrangling.
  • Data Exploration : After the data is ready, data exploration is done using various data visualization techniques to find unseen trends from the data.
  • Data Modeling : The next step is to build your predictive models using machine learning algorithms to make future predictions.
  • Result interpretation : The final step in any data analytics process is to derive meaningful results and check if the output is in line with your expected results.

Why Data Analytics Using Python?

There are many programming languages available, but Python is popularly used by statisticians, engineers, and scientists to perform data analytics.

Here are some of the reasons why Data Analytics using Python has become popular:

  • Python is easy to learn and understand and has a simple syntax.
  • The programming language is scalable and flexible.
  • It has a vast collection of libraries for numerical computation and data manipulation.
  • Python provides libraries for graphics and data visualization to build plots.
  • It has broad community support to help solve many kinds of queries.

Python Libraries for Data Analytics

One of the main reasons why Data Analytics using Python has become the most preferred and popular mode of data analysis is that it provides a range of libraries.

NumPy : NumPy supports n-dimensional arrays and provides numerical computing tools. It is useful for Linear algebra and Fourier transform.

Pandas : Pandas provides functions to handle missing data, perform mathematical operations, and manipulate the data.

Matplotlib : Matplotlib library is commonly used for plotting data points and creating interactive visualizations of the data.

SciPy : SciPy library is used for scientific computing. It contains modules for optimization, linear algebra, integration, interpolation, special functions, signal and image processing.

Scikit-Learn : Scikit-Learn library has features that allow you to build regression, classification, and clustering models.

Now, let’s look at how to perform data analytics using Python and its libraries.

Data Analytics Using the Python Library, NumPy

Let’s see how you can perform numerical analysis and data manipulation using the NumPy library.

1. Create a NumPy array.

1-import

2. Access and manipulate elements in the array.

2-access

3. Create a 2-dimensional array and check the shape of the array.

3-create

4. Access elements from the 2D array using index positions.

4-access

5. Create an array of type string.

5-create

6. Using the arange() and linspace() function to evenly space values in a specified interval.

6-using.

7. Create an array of random values between 0 and 1 in a given shape.

7-create

8. Create an array of constant values in a given shape.

/8-create

9. Repeat each element of an array by a specified number of times using repeat() and tile() functions.

/9-repeat

10. Create an identity matrix using eye() and identity() function.

10-create

Learn All The Tricks Of The BI Trade

Learn All The Tricks Of The BI Trade

11. Create a 5x5 2D array for random numbers between 0 and 1.

11-create

12. Sum an array along the column.

12-array

13. Sum an array along the row.

13-sum

14. Calculate the mean, median, standard deviation, and variance.

14-calculate

15. Sort an array along the row using the sort() function.

sort

16. Append elements to an array using the append() function.

16-append

17. Delete multiple elements in an array.

17-delete

18. Concatenate elements from 2 arrays.

18-combine

Get broad exposure to key technologies and skills used in data analytics and data science, including statistics with the  Post Graduate Program in Data Analytics .

Data Analytics Using Python Libraries, Pandas and Matplotlib

We’ll use a car.csv dataset and perform exploratory data analysis using Pandas and Matplotlib library functions to manipulate and visualize the data and find insights.

1. Import the libraries.

1-lib

2. Load the dataset using pandas read_csv() function.

2-load

3. Display the head of the dataset using the head() function.

3-display

4. Display the bottom 5 rows from the dataset using the tail() function.

/4-display

5. Print summary statistics of the dataset using the describe() function.

5-print

6.Plot a histogram for all the variables.

6-plot

7. Box plot to visualize the relationship between vehicle size and engine hp.

7-box

8. Build a pair plot using the seaborn library.

9. Drop irrelevant columns from the dataset using drop() function.

9-drop

 10. Use rename() function to rename the columns.

10-use

11. Print the total number of duplicate rows.

11-print

12. Remove the duplicate rows using the drop_duplicates() function.

12-remove

13. Drop the missing values from the dataset.

13-drop

14. Plot a histogram to find the number of cars per brand.

14-plot

15. Draw a correlation plot between the variables.

Data is getting generated rapidly in various formats. And companies are relying on data analytics to derive valuable information and hidden insights from this data. After reading this ‘Data analytics using Python’ article, you would have learned what data analytics is and the various applications of data analytics. You also looked at the different types of data analytics and process steps. Finally, you performed data analytics using Python’s NumPy, Pandas, and Matplotlib libraries.

Do you have any questions for us on this ‘Data analytics using Python’ article? If so, then please put it in the comments section of this article. Our team of experts will help you solve your queries at the earliest.

Find our PL-300 Microsoft Power BI Certification Training Online Classroom training classes in top cities:

NameDatePlace
24 Aug -8 Sep 2024,
Weekend batch
Your City
7 Sep -22 Sep 2024,
Weekend batch
Your City
21 Sep -6 Oct 2024,
Weekend batch
Your City

About the Author

Avijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

Recommended Programs

PL-300 Microsoft Power BI Certification Training

Post Graduate Program in Data Analytics

*Lifetime access to high-quality, self-paced e-learning content.

Recommended Resources

What is Power BI?: Architecture, and Features Explained

What is Power BI?: Architecture, and Features Explained

Getting Started with Microsoft Azure

Getting Started with Microsoft Azure

A Beginner's Guide to Learning Power BI the Right Way

A Beginner's Guide to Learning Power BI the Right Way

Data Storytelling: Transform Data into Business Solutions with Power BI in 60 Minutes

Data Storytelling: Transform Data into Business Solutions with Power BI in 60 Minutes

Power BI Vs Tableau: Difference and Comparison

Power BI Vs Tableau: Difference and Comparison

Introduction to Microsoft Azure Basics: A Beginner’s Guide

Introduction to Microsoft Azure Basics: A Beginner’s Guide

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

case study analysis in python

Learn Data Analysis with Python

Lessons in Coding

  • © 2018
  • A.J. Henley 0 ,
  • Dave Wolf 1

Washington, D.C., USA

You can also search for this author in PubMed   Google Scholar

Sterling Business Advantage, LLC , Adamstown, USA

  • A quick and practical hands-on guide to learning and using Python in data analysis
  • Includes three exercises and one analysis project case study
  • Learn to visualize data using Python

56k Accesses

2 Citations

This is a preview of subscription content, log in via an institution to check access.

Access this book

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

About this book

  • Get data into and out of Python code
  • Prepare the data and its format
  • Find the meaning of the data
  • Visualize the data using iPython

Similar content being viewed by others

case study analysis in python

An Overview of Python

Welcome to python.

case study analysis in python

Where to Get Python

  • machine learning
  • data science
  • application

Table of contents (6 chapters)

Front matter, how to use this book.

  • A. J. Henley, Dave Wolf

Getting Data Into and Out of Python

Preparing data is half the battle, finding the meaning, visualizing data, practice problems, back matter, authors and affiliations.

A.J. Henley

About the authors

Bibliographic information.

Book Title : Learn Data Analysis with Python

Book Subtitle : Lessons in Coding

Authors : A.J. Henley, Dave Wolf

DOI : https://doi.org/10.1007/978-1-4842-3486-0

Publisher : Apress Berkeley, CA

eBook Packages : Professional and Applied Computing , Apress Access Books , Professional and Applied Computing (R0)

Copyright Information : A.J. Henley and Dave Wolf 2018

Softcover ISBN : 978-1-4842-3485-3 Published: 23 February 2018

eBook ISBN : 978-1-4842-3486-0 Published: 22 February 2018

Edition Number : 1

Number of Pages : IX, 97

Number of Illustrations : 15 illustrations in colour

Topics : Python , Data Mining and Knowledge Discovery , Big Data , Big Data/Analytics

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. (PDF) Python Machine Learning Case Studies Five Case Studies for the

    case study analysis in python

  2. Data Analysis with Python: Part 6 of 6

    case study analysis in python

  3. Humanities Data Analysis: Case Studies with Python

    case study analysis in python

  4. Case Statement In Python

    case study analysis in python

  5. Python Case Studies that Programmers should know

    case study analysis in python

  6. GitHub

    case study analysis in python

COMMENTS

  1. 10 Real World Data Science Case Studies Projects with Example

    A case study in data science is an in-depth analysis of a real-world problem using data-driven approaches. It involves collecting, cleaning, and analyzing data to extract insights and solve challenges, offering practical insights into how data science techniques can address complex issues across various industries.

  2. Case-Studies-Python

    This repository is a companion to the textbook Case Studies in Neural Data Analysis, by Mark Kramer and Uri Eden. That textbook uses MATLAB to analyze examples of neuronal data. The material here is similar, except that we use Python. The intended audience is the practicing neuroscientist - e.g., the students, researchers, and clinicians ...

  3. Data Science Case Studies: Solved and Explained

    All of the data science case studies mentioned below are solved and explained using Python. Case Study 1: ... Use Case: Customer Personality Analysis is a detailed analysis of a company's ideal ...

  4. Using Python for Data Analysis

    Data analysis is a broad term that covers a wide range of techniques that enable you to reveal any insights and relationships that may exist within raw data. As you might expect, Python lends itself readily to data analysis. Once Python has analyzed your data, you can then use your findings to make good business decisions, improve procedures, and even make informed predictions based on what ...

  5. Python Case Studies

    A Case study in python. Creating an ML model to predict the apt price of a given diamond. Predicting the right price for an old car using python machine learning. Create an ML model to forecast the demand of rental bikes every hour of the day. Estimating the price of a computer, based on its specs.

  6. Introduction

    This project is a collection of six captivating case studies that use Python and computational techniques to analyse data, build classification models and unravel insights on multifaceted datasets. The array of topics touch on different domains of knowledge. Central to these studies is the application of tools and concepts.

  7. Data Science Case Studies: Solved using Python

    All of the data science case studies mentioned below are solved and explained using Python. Case Study 1: ... Use Case: Customer Personality Analysis is a detailed analysis of a company's ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs ...

  8. Python for the practicing neuroscientist

    Most of the functions used here are the same in Python 2 and 3. One noteable exception however is division. If you are using Python 2, you will find that the division operator / actually computes the floor of the division if both operands are integers (i.e., no decimal points). For example, in Python 2, 4/3 equals 1. While, in Python 3, 4/3 ...

  9. GitHub

    Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools in Python, with the help of realistic data. The course will help you understand how you can use pandas and Matplotlib to critically examine a dataset with summary statistics and graphs, and extract the insights you seek to derive.

  10. Logistic Regression Case Study: Statistical Analysis in Python

    Above code will load the dataset to 'data'. The 'Attrition' column is our dependent variables and others are independent. The one thing to note here is that 'Attrition' take value ...

  11. Essential Statistics for Data Science: A Case Study using Python, Part

    182SHARES. Author: Tim Dobbins Engineer & Statistician. Author: John Burke Research Analyst. Statistics. Essential Statistics for Data Science: A Case Study using Python, Part I. Get to know some of the essential statistics you should be very familiar with when learning data science. LearnDataSci is reader-supported.

  12. Analyze Traffic Safety Data with Python Case Study

    Analyze Traffic Safety Data with Python. Visualize traffic safety data and analyze the relationship between collisions and smartphone usage over time. Meet the creator of the course. Meet the full team. Andrea Hassler. Andrea has a Master's in Applied Statistics from NYU and a Bachelor's in Psychology from SUNY New Paltz.

  13. case-study · GitHub Topics · GitHub

    Business Case Study to predict customer churn rate based on Artificial Neural Network (ANN), with TensorFlow and Keras in Python. This is a customer churn analysis that contains training, testing, and evaluation of an ANN model. (Includes: Case Study Paper, Code)

  14. A Data Science Case Study with Python: Mercari Price Prediction

    A Data Science Case Study with Python: Mercari Price Prediction. ... In this case study, we will walk through the Analysis, Modelling and Communication part of the workflow. The general steps involved for solving a data science problem are as follows: ... did exploratory data analysis, feature transformations and finally selected ML models, did ...

  15. 5 Solved end-to-end Data Science Projects in Python

    1. Sentiment Analysis. The first project of this list is to build a machine learning model that predicts the sentiment of a movie review. Sentiment analysis is an NLP technique used to determine whether data is positive, negative, or neutral.

  16. Data Science Projects with Python: A case study approach to gaining

    This creates a case-study approach that simulates the working conditions you'll experience in real-world data science projects. You'll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning ...

  17. Inferential Statistical Analysis with Python

    This course is part of the Statistics with Python Specialization. When you enroll in this course, you'll also be enrolled in this Specialization. Learn new concepts from industry experts. Gain a foundational understanding of a subject or tool. Develop job-relevant skills with hands-on projects.

  18. Data Analytics With Python: Use Case Demo

    Data Analytics Using Python Libraries, Pandas and Matplotlib. We'll use a car.csv dataset and perform exploratory data analysis using Pandas and Matplotlib library functions to manipulate and visualize the data and find insights. 1. Import the libraries. 2. Load the dataset using pandas read_csv () function.

  19. Case Study (SQL, Python): Data analysis of the Olympic Games

    This case study highlights how using SQL (and Python) for data analysis allows for discovering trends and insights of particular statistical interest. The case study is based on historical data ...

  20. Learn Data Analysis with Python: A Case Study

    To prepare data for analysis, here's what we do: #import libraries. import pandas as pd. import seaborn as sns. In Python, it's usual to add "as something" when you import the library. This makes the code less lengthy when you call the libraries. Next, we need data.

  21. Analyzing E-Commerce Data with Python (Case Study)

    This code is a common initial step in data analysis using Python, where these modules are often used for data manipulation, analysis, and visualization. ... Syntax Code Explanation from Case Study ...

  22. Learn Data Analysis with Python: Lessons in Coding

    Get started using Python in data analysis with this compact practical guide. This book includes three exercises and a case study on getting data in and out of Python code in the right format. Learn Data Analysis with Pythonalso helps you discover meaning in the data using analysis and shows you how to visualize it.