Analyzing Open-Source Package Sustainability: Part 2 – Efficient Data Fetching


Home > Blogs > Analyzing Open-Source Package Sustainability: Part 2 – Efficient Data Fetching

 
PSS-Package Sustainability Scanner-Part-2.png
 

Data with Feature engineering plays a vital role in the Machine Learning lifecycle.

This is the second blog of the four series, if you haven't read the first blog, we urge you to check it out here. In the first blog, we provided an overall introduction of PSS and in this blog, we will go through the feature and data engineering process.

So far we introduced the Package Sustainability Scanner (PSS) and the challenges of evaluating the health of open-source packages. Many developers rely on labels from package managers, but these can often be misleading. Imagine depending on a package that appears stable but is actually on the verge of being abandoned, that’s every developer’s worst nightmare!

So, when browsing an open-source package’s repository, what key factors does a developer check? Typically, they’ll look at stars, forks, last commit date, number of watchers, total releases, and open/closed issues. But can these basic metrics really tell us how well-maintained a package is?

The Developer’s Dilemma

While these basic metrics offer a quick overview, they barely scratch the surface. They might hint at whether a package is growing or fading, but they don’t provide a full picture of its long-term maintenance and community support.

And let’s be real, no developer has time to manually analyze this data across hundreds of dependencies and repositories. That’s where our tool, PSS, steps in. Using a machine learning approach, PSS digs deeper than what meets the eye, helping developers make informed decisions about their dependencies.

In this blog, we’ll walk through the process of analyzing package sustainability step by step. We’ll start by selecting relevant features, cleaning up the data, gathering information from sources, and finally, performing an initial analysis to spot trends like skewness and correlations. This structured approach helps us move beyond surface-level insights and build a more accurate picture of a package’s long-term viability.

Digging Deeper: Advanced Features That Matter

To really understand a package’s health it’s crucial to go beyond the basics. Here are some of the advanced parameters we added:

  • Last Commit Across All Branches:

    • Instead of just the main branch, we check commits across all branches. This gives us insight into ongoing activities among maintainers and contributors with direct branch access.

  • Last Resolved Time:

    • This metric captures when the last feature or patch was merged, basically it is the bridge between commit and release (commit -> merge request -> release). It also provides us with a better sense of real ongoing maintenance.

  • Average Time to Resolve Issues and PRs:

    • For packages that are mature and popular experiencing decline in maintenance, this would be the first set of parameters that would show the signs of slowdown, reason being majority of popular stable packages follow quarterly or bi-quarterly release patterns so the slowdown takes time to reflect in other parameters, but this could signal when issues and PR’s are piling up.

  • Merged PR Count:

    • We refined our basic parameters by only counting merged pull requests and not just the closed one, And by noticing the archived status of repos. An archived repo is a clear indicator of inactivity.

We also captured some extra details like development status , archive status and whether description was present .

Why Feature Engineering Came First

You may wonder: why engineer features before even fetching the data? In a perfect world, we’d follow the ML life cycle step by step, However in reality we have to look into which features hold true semantic value before you build your architecture for data collection, especially when dealing with large amounts of data. This early focus on what really matters is an investment which saves us from the costly reiterations down the line. Imagine refetching all the data again for hundreds of thousands of packages just because a feature is added.

Tackling the Data Deluge: Setting the Right Filter

Pypi has around 500,000 packages with about 390,000 of them linking to a repository. After some initial Exploratory Data Analysis(EDA), a few things came out:

  • Number of releases: The distribution was extremely positively skewed (skewness >30, with Q1 just at 2 releases). It raised a red flag, are these packages with only 1 or 2 packages even worth the effort?

  • Quality of Metadata: Around 11% of packages didn’t have a description.

blog_2_filter_impact.png

Impact of Filters on Dataset Size

To ensure we gathered quality data, we applied a filter:

After filtering, we had 217,916 packages to work with. Out of these data for 199,499 packages were fetched successfully (The remaining failing mostly due to invalid repository links).

Uncovering the Dataset’s Cry for Help

The numbers weren’t just numbers, they were a wake-up call from the PyPI ecosystem. A few key issues are highlighted below:

  1. A Lot of Zero Values:

Stars Distribution (Skew: 29.31504)

Many repositories had zero activity in crucial areas, making it hard to assess their true health.

2. Highly Skewed Distributions:

Skewness of Numeric Features


The extreme skew (thanks largely to those zero values) made it clear that a straight-up average while scaling wouldn’t cut it. As we can infer from the figure above.

3. Highly Correlated Features:

heatmap

Heatmap of Feature Correlations

4. Several metrics overlapped, adding layers of complexity to the analysis.

These weren’t mere statistical quirks. They were real signals that the open-source ecosystem needed a smarter way to be evaluated and a need that the PSS aims to meet.

What’s Next?

In the upcoming blog, we'll share how we tackled these data quality challenges. We’ll look at the preprocessing techniques we used to deal with skewed distributions, zero values, and correlated features, setting the stage for the next step in building a robust model.

Stay tuned as we continue this journey, and thanks for coming along for the ride through the ups and downs of open-source package sustainability!

Next
Next

Analysing Open-Source Packages Sustainability using Machine Learning: Part-1: Introduction