PyTorch is an open-source machine learning framework originally developed by Meta Platforms (Facebook) under its AI Research Lab. This framework has become the preferred choice of academic researchers for establishing rapid prototyping and increasingly production-scale AI applications. PyTorch is useful because it provides many tools for AI development, but it is also able to optimize the speed of AI computation by involving GPU hardware in the main computation process. 

PyTorch owns its primary data structure pipeline, known as Tensors-multidimensional arrays. To process the data before feeding it into the PyTorch framework, the rules of PyTorch must be followed. 

In this tutorial, we will demonstrate how to load tabular datasets into the Tensor PyTorch format. The motivation behind this tutorial is that, although there are many tutorials that discuss how to load image datasets into PyTorch, there are very few that discuss how to process tabular data in PyTorch.  

In this tutorial, we will use the open dataset about cyber issues for simple practice. This dataset can be downloaded from the Kaggle database and is called CICIoV2024. 

import pandas as pd 

import numpy as np 

import seaborn as sns 

import matplotlib.pyplot as plt  

from sklearn.preprocessing import OneHotEncoder 

import torch 

from torch.utils.data import DatasetDataLoader 

import torch.nn as nn 

from torch.optim import Adam 

 

from torchsummary import summary 

 

from sklearn.model_selection import train_test_split 

 

Code  1. The source code used to include several features provided by the library. 

df_train = pd.read_csv(“train_dataset.csv”) 

df_val = pd.read_csv(“validation_dataset.csv”) 

df_test = pd.read_csv(“test_dataset.csv”) 

 

Code  2. This is the source code that used to load a CSV file into Pandas format. 

mean = df_train.iloc[:,:8].mean().to_numpy() 

std = df_train.iloc[:,:8].std().to_numpy() 

 

# Normalization 

 

def normalization(datastdmean): 

    _number_of_columns = data.iloc[:,:8].shape 

 

    for i in range(0number_of_columns): 

        target_column = data.iloc[:,:8].columns[i] 

        data_per_column = data.iloc[:,:8].iloc[:, i:i + 1] 

 

        data[target_column] = (data_per_column – mean[i]) / std[i] 

     

    return data 

 

df_train.iloc[:,:8] = normalization(df_trainstdmean) 

df_val.iloc[:,:8] = normalization(df_valstdmean) 

df_test.iloc[:,:8] = normalization(df_teststdmean) 

 

Code  3. The source code used to normalize the tabular data has been loaded in Pandas format. 

 

x_train = df_train.iloc[:,:8] 

y_train = df_train.iloc[:,8:] 

 

x_val = df_val.iloc[:,:8] 

y_val = df_val.iloc[:,8:] 

 

x_test = df_test.iloc[:,:8] 

y_test = df_test.iloc[:,8:] 

 

class dataset(Dataset): 

    def __init__(selfxy): 

        self.x = torch.tensor(x.values, dtype = torch.float32).to(device) 

        self.y = torch.tensor(y.values, dtype = torch.float32).to(device) 

 

    def __len__(self): 

        return len(self.x) 

     

    def __getitem__(selfindex): 

        return self.x[index], self.y[index] 

 

 

training_data = dataset(x_trainy_train) 

validation_data = dataset(x_valy_val) 

testing_data = dataset(x_testy_test) 

 

 

train_dataloader = DataLoader(training_databatch_size = 8shuffle = True) 

 

train_evaluation_dataloader = DataLoader(training_databatch_size = 1shuffle = True) 

validation_dataloader = DataLoader(validation_databatch_size = 1shuffle = False) 

test_dataloader = DataLoader(testing_databatch_size = 1shuffle = False) 

 

Code  4. The source code used to transform tabular data into a format that can be processed by Pytorch’s Tensor. 

 

Penulis: Yulianto, S.Kom., M.Kom.