Data Sampling in Machine Learning

Chanchala Gorale
2 min readJun 11, 2024

--

Data Sampling is the process of selecting a subset of data from a larger dataset. This can be useful for various reasons, including reducing the size of the dataset to make analysis more manageable, balancing class distributions in imbalanced datasets, or preparing data for training and testing in machine learning models.

Upsampling and Downsampling

Upsampling and Downsampling are techniques used in data sampling to adjust the frequency of data points. These techniques are commonly used in time series analysis and in handling imbalanced datasets.

  • Upsampling: This process increases the number of samples by adding new data points. This can be done through methods like interpolation, where new data points are created between existing data points.
  • Downsampling: This process reduces the number of samples by removing data points. This can be done by aggregating or averaging data points over a specified interval.

Example of Resampling Functions in Python (Using Pandas)

Here are examples of how to perform upsampling and downsampling using the resample function in Python's Pandas library.

Setup: Create a Sample Time Series DataFrame

import pandas as pd
import numpy as np

# Create a date range
date_rng = pd.date_range(start='2021-01-01', end='2021-01-10', freq='D')

# Create a sample DataFrame
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))

# Set the date as the index
df.set_index('date', inplace=True)
print(df)

Downsampling

Downsampling to a different frequency, for example, from daily data to every 2 days:

# Downsample to 2-day frequency
df_downsampled = df.resample('2D').mean()
print("Downsampled DataFrame:")
print(df_downsampled)

Upsampling

Upsampling to a higher frequency, for example, from daily data to hourly data, and filling in missing values with interpolation:

# Upsample to hourly frequency
df_upsampled = df.resample('H').asfreq()

# Interpolate the missing values
df_upsampled['data'] = df_upsampled['data'].interpolate(method='linear')
print("Upsampled DataFrame:")
print(df_upsampled)

Summary

  • Data Sampling: Selecting a subset of data from a larger dataset.
  • Upsampling: Increasing the number of samples by adding new data points.
  • Downsampling: Reducing the number of samples by removing data points.

By using the resample function in Pandas, you can easily upsample and downsample your time series data to suit your analysis needs.

--

--

Chanchala Gorale
Chanchala Gorale

Written by Chanchala Gorale

Founder | Product Manager | Software Developer

No responses yet