publications | Serkan Sulun

Link to my Google Scholar page

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

Serkan Sulun, Paula Viana, and Matthew E. P. Davies

In revision, Feb 2025

Abs DOI Bib PDF

We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video’s emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.
@article{soundtrack, title = {Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries}, author = {Sulun, Serkan and Viana, Paula and Davies, Matthew E. P.}, year = {2025}, month = feb, number = {arXiv:2502.10154}, eprint = {2502.10154}, primaryclass = {cs}, publisher = {arXiv}, doi = {10.48550/arXiv.2502.10154}, urldate = {2025-02-24}, archiveprefix = {arXiv}, journal = {In revision}, }
VEMOCLAP: A Video Emotion Classification Web Application

Serkan Sulun, Paula Viana, and Matthew E. P. Davies

In 2024 International Symposium on Multimedia (ISM), Dec 2024

Abs DOI Bib PDF Code

We introduce VEMOCLAP: Video EMOtion Classifier using Pretrained features, the first readily available and open-source web application that analyzes the emotional content of any user-provided video. We improve our previous work, which exploits open-source pretrained models that work on video frames and audio, and then efficiently fuse the resulting pretrained features using multi-head cross-attention. Our approach increases the state-of-the-art classification accuracy on the Ekman-6 video emotion dataset by 4.3% and offers an online application for users to run our model on their own videos or YouTube videos. We invite the readers to try our application at https://serkansulun.com/app
@inproceedings{vemoclap, title = {VEMOCLAP: A Video Emotion Classification Web Application}, shorttitle = {VEMOCLAP}, booktitle = {2024 International Symposium on Multimedia (ISM)}, author = {Sulun, Serkan and Viana, Paula and Davies, Matthew E. P.}, year = {2024}, month = dec, pages = {137--140}, publisher = {IEEE Computer Society}, doi = {10.1109/ISM63611.2024.00029}, urldate = {2025-05-17}, isbn = {9798331511111}, langid = {english}, }
Movie Trailer Genre Classification Using Multimodal Pretrained Features

Serkan Sulun, Paula Viana, and Matthew E.P. Davies

Expert Systems with Applications, Dec 2024

Abs DOI Bib PDF Code

We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP). To foster future research, we make the pretrained features for the entire MovieNet dataset, along with our genre classification code and the trained models, publicly available.
@article{trailer, title = {Movie Trailer Genre Classification Using Multimodal Pretrained Features}, author = {Sulun, Serkan and Viana, Paula and Davies, Matthew E.P.}, year = {2024}, journal = {Expert Systems with Applications}, volume = {258}, pages = {125209}, issn = {0957-4174}, doi = {10.1016/j.eswa.2024.125209}, }
Emotion4MIDI: A Lyrics-Based Emotion-Labeled Symbolic Music Dataset

Serkan Sulun, Pedro Oliveira, and Paula Viana

In Progress in Artificial Intelligence, Dec 2023

Abs DOI Bib PDF Code

We present a new large-scale emotion-labeled symbolic music dataset consisting of 12k MIDI songs. To create this dataset, we first trained emotion classification models on the GoEmotions dataset, achieving state-of-the-art results with a model half the size of the baseline. We then applied these models to lyrics from two large-scale MIDI datasets. Our dataset covers a wide range of fine-grained emotions, providing a valuable resource to explore the connection between music and emotions and, especially, to develop models that can generate music based on specific emotions. Our code for inference, trained models, and datasets are available online.
@inproceedings{sulunEmotion4MIDILyricsBasedEmotionLabeled2023a, title = {Emotion4MIDI: A Lyrics-Based Emotion-Labeled Symbolic Music Dataset}, booktitle = {Progress in Artificial Intelligence}, author = {Sulun, Serkan and Oliveira, Pedro and Viana, Paula}, editor = {Moniz, Nuno and Vale, Zita and Cascalho, Jos{\'e} and Silva, Catarina and Sebasti{\~a}o, Raquel}, year = {2023}, pages = {77--89}, publisher = {Springer Nature Switzerland}, address = {Cham}, isbn = {978-3-031-49011-8}, doi = {10.1007/978-3-031-49011-8_7}, }
Symbolic Music Generation Conditioned on Continuous-Valued Emotions

Serkan Sulun, Matthew E. P. Davies, and Paula Viana

IEEE Access, Dec 2022

Abs DOI Bib PDF Code

In this paper we present a new approach for the generation of multi-instrument symbolic music driven by musical emotion. The principal novelty of our approach centres on conditioning a state-of-the-art transformer based on continuous-valued valence and arousal labels. In addition, we provide a new large-scale dataset of symbolic music paired with emotion labels in terms of valence and arousal. We evaluate our approach in a quantitative manner in two ways, first by measuring its note prediction accuracy, and second via a regression task in the valence-arousal plane. Our results demonstrate that our proposed approaches outperform conditioning using control tokens which is representative of the current state of the art.
@article{music, title = {Symbolic Music Generation Conditioned on Continuous-Valued Emotions}, author = {Sulun, Serkan and Davies, Matthew E. P. and Viana, Paula}, year = {2022}, journal = {IEEE Access}, volume = {10}, pages = {44617--44626}, doi = {10.1109/ACCESS.2022.3169744}, }
Can Learned Frame Prediction Compete with Block Motion Compensation for Video Coding?

Serkan Sulun and A. Murat Tekalp

Signal, Image and Video Processing, Dec 2021

Abs DOI Bib PDF Supp Code

Given recent advances in learned video prediction, we investigate whether a simple video codec using a pre-trained deep model for next frame prediction based on previously encoded/decoded frames without sending any motion side information can compete with standard video codecs based on block-motion compensation. Frame differences given learned frame predictions are encoded by a standard still-image (intra) codec. Experimental results show that the rate-distortion performance of the simple codec with symmetric complexity is on average better than that of x264 codec on 10 MPEG test videos, but does not yet reach the level of x265 codec. This result demonstrates the power of learned frame prediction (LFP), since unlike motion compensation, LFP does not use information from the current picture. The implications of training with L1, L2, or combined L2 and adversarial loss on prediction performance and compression efficiency are analyzed.
@article{frame_prediction, ids = {sulun2020can}, title = {Can Learned Frame Prediction Compete with Block Motion Compensation for Video Coding?}, author = {Sulun, Serkan and Tekalp, A. Murat}, year = {2021}, journal = {Signal, Image and Video Processing}, volume = {15}, number = {2}, pages = {401--410}, publisher = {Springer}, doi = {10.1007/s11760-020-01751-y}, }
On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks

Serkan Sulun and Matthew E. P. Davies

IEEE Journal of Selected Topics in Signal Processing, Dec 2020

Abs DOI Bib PDF Supp Code

In this paper, we address a sub-topic of the broad domain of audio enhancement, namely musical audio bandwidth extension. We formulate the bandwidth extension problem using deep neural networks, where a band-limited signal is provided as input to the network, with the goal of reconstructing a full-bandwidth output. Our main contribution centers on the impact of the choice of low pass filter when training and subsequently testing the network. For two different state of the art deep architectures, ResNet and U-Net, we demonstrate that when the training and testing filters are matched, improvements in signal-to-noise ratio (SNR) of up to 7dB can be obtained. However, when these filters differ, the improvement falls considerably and under some training conditions results in a lower SNR than the band-limited input. To circumvent this apparent overfitting to filter shape, we propose a data augmentation strategy which utilizes multiple low pass filters during training and leads to improved generalization to unseen filtering conditions at test time.
@article{sulunFilterGeneralizationMusic2020, ids = {DBLP:journals/jstsp/SulunD21}, title = {On Filter Generalization for Music Bandwidth Extension Using Deep Neural Networks}, author = {Sulun, Serkan and Davies, Matthew E. P.}, year = {2020}, journal = {IEEE Journal of Selected Topics in Signal Processing}, volume = {15}, number = {1}, pages = {132--142}, publisher = {IEEE}, doi = {10.1109/JSTSP.2020.3037485}, }
Deep Learned Frame Prediction for Video Compression

Serkan Sulun

Koc University, Dec 2018

Master’s Thesis

Abs DOI Bib PDF Code

Motion compensation is one of the most essential methods for any video compression algorithm. Video frame prediction is a task analogous to motion compensation. In recent years, the task of frame prediction is undertaken by deep neural networks (DNNs). In this thesis we create a DNN to perform learned frame prediction and additionally implement a codec that contains our DNN. We train our network using two methods for two different goals. Firstly we train our network based on mean square error (MSE) only, aiming to obtain highest PSNR values at frame prediction and video compression. Secondly we use adversarial training to produce visually more realistic frame predictions. For frame prediction, we compare our method with the baseline methods of frame difference and 16x16 block motion compensation. For video compression we further include x264 video codec in the comparison. We show that in frame prediction, adversarial training produces frames that look sharper and more realistic, compared MSE based training, but in video compression it consistently performs worse. This proves that even though adversarial training is useful for generating video frames that are more pleasing to the human eye, they should not be employed for video compression. Moreover, our network trained with MSE produces accurate frame predictions, and in quantitative results, for both tasks, it produces comparable results in all videos and outperforms other methods on average. More specifically, learned frame prediction outperforms other methods in terms of rate-distortion performance in case of high motion video, while the rate-distortion performance of our method is competitive with that of x264 in low motion video.
@phdthesis{frame_prediction_thesis, title = {Deep Learned Frame Prediction for Video Compression}, author = {Sulun, Serkan}, year = {2018}, eprint = {1811.10946}, primaryclass = {cs, eess}, address = {Istanbul, Turkey}, urldate = {2022-07-20}, archiveprefix = {arXiv}, school = {Koc University}, doi = {10.48550/arXiv.1811.10946}, note = {Master's Thesis}, }