Giter VIP home page Giter VIP logo

neural-style-audio-tf's Introduction

Audio Style Transfer

This is a TensorFlow reimplementation of Vadim's Lasagne code for style transfer algorithm for audio, which uses convolutions with random weights to represent audio features.

To listen to examples go to the blog post. Also check out Torch implementation.

So far it is CPU only, but if you are proficient in TensorFlow it should be easy to switch. Actually it runs fast on CPU.

Dependencies

pip install librosa
  • numpy and matplotlib

The easiest way to install python is to use Anaconda.

How to run

  • Open neural-style-audio-tf.ipynb in Jupyter.
  • In case you want to use your own audio files as inputs, first cut them to 10s length with:
ffmpeg -i yourfile.mp3 -ss 00:00:00 -t 10 yourfile_10s.mp3
  • Set CONTENT_FILENAME and STYLE_FILENAME in the third cell of Jupyter notebook to your input files.
  • Run all cells.

The most frequent problem is domination of either content or style in the output. To fight this problem, adjust ALPHA parameter. Larger ALPHA means more content in the output, and ALPHA=0 means no content, which reduces stylization to texture generation. Example output outputs/imperial_usa.wav, the result of mixing content of imperial march from star wars with style of U.S. National Anthem, was obtained with default value ALPHA=1e-2.

References

neural-style-audio-tf's People

Contributors

dmitryulyanov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neural-style-audio-tf's Issues

Why the hell is this not talked about more?

You added this 3 years ago, and I am just now finding it. I have been searching for an implementation of neural style that treats music as the images, in this case waveforms. This is amazing have you built more upon this? Thanks for this repo.

add the ability to load a pretrained net?

Though I fully trust Dmitry and believe in his claim that a random cnn is as good as a pretrained net in detecting and extracting texture features (the "style"), I would really appreciate the possibility of testing some pretrained net for extracting the "content" features.
While experiencing with this lovely software I found that its ability to discriminate the content structure in "content" sound files does not appear as accurate as in the examples provided elsewhere for the "image style transfer" case. In particular it seems that too much of the style still remains in the content, and this is perhaps the cause of high dominance of some audio files when combined with others.
I noted that the best combinations (i.e., where the "content" audio imposes only its structure and the "style" audio enforces its own texture) are produced, when the spectra of the two audios share most of their frequencies, but the "style" has less structure or, in other words, less evident "beats". This would correspond, in images, to the "style" image having mostly the same spectrum as the "content" one, but featuring weaker and shorter edges. The output audio, in this case, resembles a combination of an "envelope" taken from the "content" audio, modulating the amplitude of the "style" audio.
On the other hand, when the "style" audio lies on a mostly different region of the frequency spectrum (e.g. higher frequencies) with respect to the "content", then the two audios get mixed (their spectra appear to be merged) and both are almost equally present in the output, producing in most cases very confusing output.
I can provide some examples, but I guess anyone can figure out what I'm trying to explain, by testing on the available audio samples.
By looking at the results produced by applying style transfer to images, I would expect a different behavior, where the style (i.e. the texture) of the "style" image almost completely substitute the texture of the "content" image. I suspect that some more investigation might be needed in the selection of the most suitable net for content features selection, and therefore I would love to have some hints about how to load and use a pretrained network.
Sorry for the long message.

Why no 1D convolution ?

In your blog, you wrote 1D convolution works better than 2D ones. But this tensorflow version didnt use Conv1D but Conv2D. Why is that? any reason or am I missing out something else?

Getting Syntax error and deprecated TF function

Hi Dmitry,
thanks for putting this together, this is exactly what I was looking for an experiment!
I am definitely a beginner in this, but I was trying to run your example and I get a Syntax error on the Optimise kernel and in the Output in the print as now you have to add parenthesis.

File "<ipython-input-16-9eb962c6044b>", line 50
    print 'Final loss:', loss.eval()
                      ^
SyntaxError: invalid syntax

I also figured out that tf.initialize_all_variables() is now deprecated and so changed it to tf.global_variables_initializer()

Then it all works well!
Thanks!

learning style from more than one example?

I'm trying to understand if it would make sense to learn style from a group of examples (in this case, audio files) instead of just one. In the best case this would produce a sort of "mean style" representing the group of audio excerpts. In your experience, would such an approach work (as long as the examples do somehow share some style in common), or it would produce just garbage?

AttributeError: module 'librosa' has no attribute 'output'

Hi Dmitry,
I wanted to try to do a audio style transfer, but get this error on the optimize and invert spectrum step.

Started optimization.
INFO:tensorflow:Optimization terminated with:
Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
Objective function value: 1785.756958
Number of iterations: 300
Number of functions evaluations: 309
Final loss: 1785.7569580078125

AttributeError Traceback (most recent call last)
in ()
----> 1 get_ipython().run_cell_magic('time', '', 'from sys import stderr\n\n#@markdown ---\n#@markdown Advanced settings / Расширенные настройки\nALPHA= 0.1 #@param {type:"slider", min:0.01, max:0.2, step:0.01}\nlearning_rate= 0.01 #@param {type:"slider", min:0.001, max:0.02, step:0.001}\niterations = 300 #@param {type:"slider", min:100, max:500, step:10}\n#@markdown ---\nresult = None\nwith tf.Graph().as_default():\n\n # Build graph with variable input\n #x = tf.Variable(np.zeros([1,1,N_SAMPLES,N_CHANNELS], dtype=np.float32), name="x")\n x = tf.Variable(np.random.randn(1,1,N_SAMPLES,N_CHANNELS).astype(np.float32)*1e-3, name="x")\n\n kernel_tf = tf.constant(kernel, name="kernel", dtype='float32')\n conv = tf.nn.conv2d(\n x,\n kernel_tf,\n strides=[1, 1, 1, 1],\n padding="VALID",\n name="conv")\n \n \n net = tf.nn.relu(conv)\n\n content_loss = ALPHA * 2 * tf.nn.l2_loss(\n net - content_features)\n\n style_loss = 0\n\n _, height, width, number = map(lambda i: i.value, net.get_shape())\n\n size = height * width * number\n feats = tf.reshape(net, (-1, number))\n gram = tf.matmul(tf.transpose(feats), feats) / N_SAMPLES\n style_loss = 2 * tf.nn.l2_loss(gram - style_gram)\n\n # Overall loss\n loss = content_loss + style_loss\n\n opt = tf.contrib.opt.S...

2 frames
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2115 magic_arg_s = self.var_expand(line, stack_depth)
2116 with self.builtin_trap:
-> 2117 result = fn(magic_arg_s, cell)
2118 return result
2119

in time(self, line, cell, local_ns)

/usr/local/lib/python3.7/dist-packages/IPython/core/magic.py in (f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):

/usr/local/lib/python3.7/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
1191 else:
1192 st = clock2()
-> 1193 exec(code, glob, local_ns)
1194 end = clock2()
1195 out = None

in ()

AttributeError: module 'librosa' has no attribute 'output'

what does it take to produce longer outputs?

hello Dmitry,
a quick question. How do I produce longer output files with this approach? Should I necessary provide longer inputs or there is another way?
Thank you very much for sharing your results
Giancarlo

Blank screen error

Hi Dimitry :)
I have encountered some kind of error, when trying to transfer style from one song to another.
After running few cells, screen goes black and I cannot use keyboard nor mouse, cant enter tty mode - looks like regular system crash.
I'm using ubuntu 16.04 with tensorflow gpu (geforce 760gti 2gb vram).
Is this problem caused by using gpu version?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.