How to Load Tabular Data into PyTorch Environments

Article

PyTorch is an open-source machine learning framework originally developed by Meta Platforms (Facebook) under its AI Research Lab. This framework has become the preferred choice of academic researchers for establishing rapid prototyping and increasingly production-scale AI applications. PyTorch is useful because it provides many tools for AI development, but it is also able to optimize the speed of AI computation by involving GPU hardware in the main computation process.

PyTorch owns its primary data structure pipeline, known as Tensors-multidimensional arrays. To process the data before feeding it into the PyTorch framework, the rules of PyTorch must be followed.

In this tutorial, we will demonstrate how to load tabular datasets into the Tensor PyTorch format. The motivation behind this tutorial is that, although there are many tutorials that discuss how to load image datasets into PyTorch, there are very few that discuss how to process tabular data in PyTorch.

In this tutorial, we will use the open dataset about cyber issues for simple practice. This dataset can be downloaded from the Kaggle database and is called CICIoV2024.

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder

import torch

from torch.utils.data import Dataset, DataLoader

import torch.nn as nn

from torch.optim import Adam

from torchsummary import summary

from sklearn.model_selection import train_test_split

Code 1. The source code used to include several features provided by the library.

df_train = pd.read_csv(“train_dataset.csv”)

df_val = pd.read_csv(“validation_dataset.csv”)

df_test = pd.read_csv(“test_dataset.csv”)

Code 2. This is the source code that used to load a CSV file into Pandas format.

mean = df_train.iloc[:,:8].mean().to_numpy()

std = df_train.iloc[:,:8].std().to_numpy()

# Normalization

def normalization(data, std, mean):

    _, number_of_columns = data.iloc[:,:8].shape

    for i in range(0, number_of_columns):

        target_column = data.iloc[:,:8].columns[i]

        data_per_column = data.iloc[:,:8].iloc[:, i:i + 1]

        data[target_column] = (data_per_column – mean[i]) / std[i]

    return data

df_train.iloc[:,:8] = normalization(df_train, std, mean)

df_val.iloc[:,:8] = normalization(df_val, std, mean)

df_test.iloc[:,:8] = normalization(df_test, std, mean)

Code 3. The source code used to normalize the tabular data has been loaded in Pandas format.

x_train = df_train.iloc[:,:8]

y_train = df_train.iloc[:,8:]

x_val = df_val.iloc[:,:8]

y_val = df_val.iloc[:,8:]

x_test = df_test.iloc[:,:8]

y_test = df_test.iloc[:,8:]

class dataset(Dataset):

    def __init__(self, x, y):

        self.x = torch.tensor(x.values, dtype = torch.float32).to(device)

        self.y = torch.tensor(y.values, dtype = torch.float32).to(device)

    def __len__(self):

        return len(self.x)

    def __getitem__(self, index):

        return self.x[index], self.y[index]

training_data = dataset(x_train, y_train)

validation_data = dataset(x_val, y_val)

testing_data = dataset(x_test, y_test)

train_dataloader = DataLoader(training_data, batch_size = 8, shuffle = True)

train_evaluation_dataloader = DataLoader(training_data, batch_size = 1, shuffle = True)

validation_dataloader = DataLoader(validation_data, batch_size = 1, shuffle = False)

test_dataloader = DataLoader(testing_data, batch_size = 1, shuffle = False)

Code 4. The source code used to transform tabular data into a format that can be processed by Pytorch’s Tensor.

Penulis: Yulianto, S.Kom., M.Kom.