Solving Real-World Data Analysis Questions with Python! (Internet Usage Analysis)
Summary
This live stream session focuses on conducting real-world Python data analysis using internet usage data from the World Bank. The primary goal is to teach practical Python analysis skills while raising money for giveinternet.org, a non-profit dedicated to providing internet access and laptops to underserved communities globally. The session covers data collection from the World Bank DataBank, emphasizing the use of pandas for data cleaning, including renaming columns, filtering irrelevant historical data (pre-1990), and handling missing values. Key data manipulation techniques like transposing and converting data types are demonstrated. The analysis then moves into data visualization using Plotly Express to illustrate internet usage trends over time for selected countries like Germany, Greenland, and Turkey, and explores methods for identifying countries with the highest recent growth in internet access.
Key takeaway
For Data Analysts and Data Scientists working with time-series or global economic data, you should prioritize robust data cleaning and transformation steps before visualization. Specifically, leverage tools like `pyjanitor` for consistent column naming and `pandas.DataFrame.interpolate` to handle missing values, ensuring your visualizations accurately reflect trends and support meaningful insights for stakeholders.
Key insights
Real-world data analysis combines Python skills with a focus on data cleaning, transformation, and visualization.
Principles
- Standardize column names for easier data manipulation.
- Filter irrelevant historical data to improve analysis focus.
- Interpolate missing time-series data to reveal trends.
Method
Collect World Bank data, clean column names and types, filter by relevant years and series, transpose for time-series analysis, and visualize trends using interactive plotting libraries like Plotly.
In practice
- Use `pyjanitor` for standardized column cleaning.
- Employ `pd.to_numeric` with `errors='coerce'` for robust type conversion.
- Apply `df.interpolate(method='linear')` for time-series gap filling.
Topics
- Python Data Analysis
- Pandas Library
- Data Cleaning
- Internet Usage Trends
- Data Visualization
Best for: Data Scientist, Data Analyst, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Keith Galli.