AI Cover Musician

Published: May 17, 2025

Takes a song, interprets the lyrics, produces a cover of that song.

Here is a sample of what it produces:

AI Cover Musician was submitted to the AWS Marketplace Developer Challenge: ML Powered Solutions hackathon. Below is a demo video demostrating how covers are created.

Follow AI Cover Musician on:

Sound Cloud: https://soundcloud.com/ai-music-covers

600+ plays on Sound Cloud!!!

YouTube: https://www.youtube.com/channel/UC1L1IyK0OJHqMtY6ttXtvCw?

Repo Link: https://github.com/basilwong/ai-cover-musician

Submission Link: https://devpost.com/software/ai-cover-musician

Inspiration

Everyone is inside and depressed right now. What could be something that would make them chuckle? Music is a phenomenon that connects everyone, why not produce an interesting take on it with AWS AI products?

There’s a high barrier of entry for kids into technology. Maybe this project could eventually become something that helps connect kids growing up who are interested in music get introduced to the ml technology of AWS.

What it does

Short Answer: Takes a song, interprets it, produces a cover of that song.

‘Longer’ Answer: Takes a song, uses the ‘Quantphi Sound Separation’ package in AWS Marketplace to isolate the vocals of the song. The isolated vocals of the song are then interpreted by Amazon Transcribe. The generated transcriptions are then fed into Amazon Polly to generate the speech from the interpretation. The resulting mp3 is then mixed with the accompaniment of the vocals resulting in the final product: a cover of the original song.

How I built it

This project was entirely built in python 3.

Amazon Sagemaker:

Used for accessing all AWS APIs

Quantphi: Source Separation:

Used to take the mp3 of a song and isolates the vocals

Amazon Transcribe:

Used to transcribe the isolated vocals of the song

Amazon Polly:

Used to generate the vocals for AI Cover Musician.
This is done word by word to allow flexibility with timing. Meaning for each word an mp3 is generated.
Then each mp3 file is concatenated together, and then mixed with the accompaniment. While the mp3 files are being concatenated together each mp3 is being checked and modified for pitch and timing.
Pitch detection is done using the deep learning python library Crepe. https://pypi.org/project/crepe/

Challenges I ran into

While Getting it to Work:

Audio Files were too large: The first challenge was that the API for the ‘Quantphi Sound Separation’ package would throw a timeout/error for mp3 files longer than 30 seconds. To workaround this, the song is split into 30 second segments and fed into the API of the model package as a batch.
Dependencies: ffmpeg is a library that is pretty much required if you’re gonna be working with audio files with software. I couldn’t figure out how to install ffmpeg for the Sagemaker environment as ‘yum’ doesn’t doesn’t have all the correct dependencies to install it. Ended up figuring out that Sagemaker is built on the Amazon Linux distro, which allowed me to piece together how to fulfill this dependency of the project.
Interpreting Documentation: Figuring out how to get Amazon Transcribe and Amazon Polly to work using their Python SDK. I wish especially Amazon Polly had more thorough documentation on how to query the API.

While Getting it to Sound Less Like Garbage:

Amazon Transcribe’s start and end time: When transcribing the music, Amazon Transcribe tends to have very conservative start and end times. Meaning the start and end times of words generated by Amazon Transcribe tend to be too short. This results in the pronunciation of the song lyrics becoming very choppy and at some points unintelligible since Amazon Polly used the start and end times for the timing and length of its output.
Fixing the timing of the song: The vocals tend to go faster than the background since Amazon Polly only specifies the max length a body of text is said. To compensate for this the expected start time was tracked using the Amazon Transcribe timestamps and pauses were added.
Pitch shifting the vocals: For some reason this never went according to plan :/
Some limitations are that Amazon Transcribe is that it’s not designed to recognize words in song. Thus, words that are included in riffs are not recognized, words included in quickly executed rap bars are not recognized, and complicated lyrics are not transcribed properly.

Accomplishments that I’m proud of

Coming up with an end solution that didn’t give me a headache immediately was a really big accomplishment.

What I learned

I started out with no experience working with AWS Sagemaker or any of the other products (AWS Marketplace, Amazon Polly, Amazon Transcribe, or any of the Amazon APIs). After working on this project I feel much more confident building stuff with those products now.

I also had no experience working on music(or sound for that matter) with code at all. Learning about the python libraries that allow you to manipulate sound was very interesting.

What’s next for AI Cover Musician

TODO List:

Supporting songs in multiple languages
Interpreting the accompaniment (into MIDI) and generating it again
pitch shifting without changing duration
improve pitch correction

AI Cover Musician’s goal is to become world famous because of music.

Share on

Twitter Facebook Google+ LinkedIn

Basil Wong