ANALISIS PERBANDINGAN EKSTRAKSI FITUR GAMMATONE-FREQUENCY CEPSTRAL COEFFICIENTS DAN CONSTANT-Q CEPSTRAL COEFFICIENTS DALAM DETEKSI AUDIO DEEPFAKE | ELECTRONIC THESES AND DISSERTATION

Electronic Theses and Dissertation

Universitas Syiah Kuala

    SKRIPSI

ANALISIS PERBANDINGAN EKSTRAKSI FITUR GAMMATONE-FREQUENCY CEPSTRAL COEFFICIENTS DAN CONSTANT-Q CEPSTRAL COEFFICIENTS DALAM DETEKSI AUDIO DEEPFAKE


Pengarang

Furqan Al Ghifari Zulva - Personal Name;

Dosen Pembimbing

Taufik Fuadi Abidin - 197010081994031002 - Dosen Pembimbing I
Rasudin - 197410011999031001 - Dosen Pembimbing II



Nomor Pokok Mahasiswa

2108107010053

Fakultas & Prodi

Fakultas MIPA / Informatika (S1) / PDDIKTI : 55201

Subject
-
Kata Kunci
-
Penerbit

Banda Aceh : Fakultas MIPA Informatika., 2026

Bahasa

No Classification

-

Literature Searching Service

Hard copy atau foto copy dari buku ini dapat diberikan dengan syarat ketentuan berlaku, jika berminat, silahkan hubungi via telegram (Chat Services LSS)

Deteksi audio deepfake menjadi tantangan penting dalam bidang keamanan digital seiring meningkatnya kemampuan teknologi sintesis suara berbasis deep learning. Teknologi ini memungkinkan pembuatan suara sintetis yang sangat menyerupai suara asli seseorang, sehingga berpotensi disalahgunakan untuk penipuan, manipulasi informasi, maupun pelanggaran privasi. Penelitian ini bertujuan untuk menganalisis dan membandingkan efektivitas dua metode ekstraksi fitur berbasis cepstral, yaitu Gammatone-Frequency Cepstral Coefficients (GFCC) dan Constant-Q Cepstral Coefficients (CQCC), serta kombinasi keduanya, dalam mendeteksi audio deepfake menggunakan arsitektur Convolutional Neural Network–Bidirectional Long Short-Term Memory (CNN–BiLSTM). Data yang digunakan adalah dataset Fake or Real (FoR) yang mencakup lebih dari 198.000 sampel audio dalam empat varian kondisi berbeda, meliputi original, normalised, 2-seconds, dan rerecorded. Model CNN–BiLSTM dilatih menggunakan optimizer Adam, fungsi loss cross-entropy, ukuran batch size sebesar 32, dan 50 epochs. Evaluasi dilakukan menggunakan metrik accuracy, precision, recall, F1-score, Area Under Curve (AUC), dan Equal Error Rate (EER). Hasil eksperimen menunjukkan bahwa model dengan kombinasi fitur GFCC–CQCC mencapai kinerja terbaik dengan accuracy 98,50%, F1-score 98,30%, recall 98,73%, AUC 0,9990, dan EER 1,51%. Kombinasi fitur memberikan peningkatan performa dibandingkan penggunaan fitur tunggal, karena GFCC unggul dalam menangkap karakteristik alami saluran vokal, sedangkan CQCC efektif dalam mendeteksi artefak frekuensi nonlinier khas audio sintetis. Selain itu, model CNN–BiLSTM berhasil diimplementasikan ke dalam sistem berbasis client–server dengan backend FastAPI dan frontend aplikasi mobile Flutter. Penelitian ini menunjukkan potensi integrasi metode cepstral ganda dan arsitektur gabungan CNN–BiLSTM dalam memperkuat sistem keamanan berbasis suara terhadap ancaman manipulasi digital.

Audio deepfake detection has become a crucial challenge in digital security as advances in deep learning enable highly realistic synthetic voice generation. This technology allows the creation of speech that closely resembles a real person’s voice, posing serious risks of misuse in fraud, misinformation, and privacy violations. This study aims to analyze and compare the effectiveness of two cepstral-based feature extraction methods, namely Gammatone Frequency Cepstral Coefficients (GFCC) and Constant-Q Cepstral Coefficients (CQCC), as well as their combination, in detecting audio deepfakes using a Convolutional Neural Network–Bidirectional Long Short-Term Memory (CNN–BiLSTM) architecture. The experiment utilizes the Fake or Real (FoR) dataset, consisting of more than 198,000 audio samples across four variants (original, normalized, 2-seconds, and rerecorded). The CNN–BiLSTM model was trained using the Adam optimizer, cross-entropy loss function, a batch size of 32, and 50 epochs. Model performance was evaluated using accuracy, precision, recall, F1-score, Area Under Curve (AUC), and Equal Error Rate (EER) metrics. Experimental results show that the combined GFCC–CQCC feature model achieved the best performance with an accuracy of 98.50%, F1-score of 98.30%, recall of 98.73%, AUC of 0.9990, and EER of 1.51%. The combination of features enhanced performance compared to individual features, as GFCC effectively captures natural vocal tract characteristics, while CQCC is more sensitive to nonlinear spectral artifacts typical of synthetic speech. Furthermore, the trained CNN–BiLSTM model was integrated into a client–server system using FastAPI for the backend and Flutter for the mobile frontend, enabling real-time voice authenticity detection. This research demonstrates the potential of combining cepstral- based features with a hybrid CNN–BiLSTM architecture to strengthen voice-based security systems against digital manipulation threats.

Citation



    SERVICES DESK