How to Load Tabular Data into PyTorch Environments
PyTorch is an open-source machine learning framework originally developed by Meta Platforms (Facebook) under its AI Research Lab. This framework has become the preferred choice of academic researchers for establishing rapid prototyping and increasingly production-scale AI applications. PyTorch is useful because it provides many tools for AI development, but it is also able to optimize the speed of AI computation by involving GPU hardware in the main computation process.
PyTorch owns its primary data structure pipeline, known as Tensors-multidimensional arrays. To process the data before feeding it into the PyTorch framework, the rules of PyTorch must be followed.
In this tutorial, we will demonstrate how to load tabular datasets into the Tensor PyTorch format. The motivation behind this tutorial is that, although there are many tutorials that discuss how to load image datasets into PyTorch, there are very few that discuss how to process tabular data in PyTorch.
In this tutorial, we will use the open dataset about cyber issues for simple practice. This dataset can be downloaded from the Kaggle database and is called CICIoV2024.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch.optim import Adam
from torchsummary import summary
from sklearn.model_selection import train_test_split
Code 1. The source code used to include several features provided by the library.
df_train = pd.read_csv(“train_dataset.csv”)
df_val = pd.read_csv(“validation_dataset.csv”)
df_test = pd.read_csv(“test_dataset.csv”)
Code 2. This is the source code that used to load a CSV file into Pandas format.
mean = df_train.iloc[:,:8].mean().to_numpy()
std = df_train.iloc[:,:8].std().to_numpy()
# Normalization
def normalization(data, std, mean):
_, number_of_columns = data.iloc[:,:8].shape
for i in range(0, number_of_columns):
target_column = data.iloc[:,:8].columns[i]
data_per_column = data.iloc[:,:8].iloc[:, i:i + 1]
data[target_column] = (data_per_column – mean[i]) / std[i]
return data
df_train.iloc[:,:8] = normalization(df_train, std, mean)
df_val.iloc[:,:8] = normalization(df_val, std, mean)
df_test.iloc[:,:8] = normalization(df_test, std, mean)
Code 3. The source code used to normalize the tabular data has been loaded in Pandas format.
x_train = df_train.iloc[:,:8]
y_train = df_train.iloc[:,8:]
x_val = df_val.iloc[:,:8]
y_val = df_val.iloc[:,8:]
x_test = df_test.iloc[:,:8]
y_test = df_test.iloc[:,8:]
class dataset(Dataset):
def __init__(self, x, y):
self.x = torch.tensor(x.values, dtype = torch.float32).to(device)
self.y = torch.tensor(y.values, dtype = torch.float32).to(device)
def __len__(self):
return len(self.x)
def __getitem__(self, index):
return self.x[index], self.y[index]
training_data = dataset(x_train, y_train)
validation_data = dataset(x_val, y_val)
testing_data = dataset(x_test, y_test)
train_dataloader = DataLoader(training_data, batch_size = 8, shuffle = True)
train_evaluation_dataloader = DataLoader(training_data, batch_size = 1, shuffle = True)
validation_dataloader = DataLoader(validation_data, batch_size = 1, shuffle = False)
test_dataloader = DataLoader(testing_data, batch_size = 1, shuffle = False)
Code 4. The source code used to transform tabular data into a format that can be processed by Pytorch’s Tensor.
Penulis: Yulianto, S.Kom., M.Kom.
Comments :