dmitryulyanov / neural-style-audio-tf Goto Github PK

TensorFlow implementation for audio neural style.

Jupyter Notebook 100.00%

neural-style-audio-tf's Introduction

Audio Style Transfer

This is a TensorFlow reimplementation of Vadim's Lasagne code for style transfer algorithm for audio, which uses convolutions with random weights to represent audio features.

To listen to examples go to the blog post. Also check out Torch implementation.

So far it is CPU only, but if you are proficient in TensorFlow it should be easy to switch. Actually it runs fast on CPU.

Dependencies

python (tested with 2.7)
TensorFlow (installation instructions)
librosa

pip install librosa

numpy and matplotlib

The easiest way to install python is to use Anaconda.

How to run

Open neural-style-audio-tf.ipynb in Jupyter.
In case you want to use your own audio files as inputs, first cut them to 10s length with:

ffmpeg -i yourfile.mp3 -ss 00:00:00 -t 10 yourfile_10s.mp3

Set CONTENT_FILENAME and STYLE_FILENAME in the third cell of Jupyter notebook to your input files.
Run all cells.

The most frequent problem is domination of either content or style in the output. To fight this problem, adjust ALPHA parameter. Larger ALPHA means more content in the output, and ALPHA=0 means no content, which reduces stylization to texture generation. Example output outputs/imperial_usa.wav, the result of mixing content of imperial march from star wars with style of U.S. National Anthem, was obtained with default value ALPHA=1e-2.

References

Original paper on style transfer: A Neural Algorithm of Artistic Style
Neural style TensorFlow implementation
Publications on texture generation with random convolutions:
Extreme Style Machines
Texture Synthesis Using Shallow Convolutional Networks with Random Filters
A Powerful Generative Model Using Random Weights for the Deep Image Representation

neural-style-audio-tf's People

Contributors

Stargazers

Watchers

Forkers

ml-lab vsooda jfsantos nimmen cash2one jlertle benjamesbabala jmiller656 laventura stevenlol jdc08161063 allensmile oduerr arbdigital vyraun profkittyface federicosan diggerdu bigsnarfdude jantb kharvd neucoder artistic-ai ianhalbwachs nifannn carloslema denkii xukai92 hzfeibao gumplus tcwalther paddymahoney leezqcst qqsantaclaus nami3373 zwhinmedia itimetraveler hugh-obrien monjovi xi-studio yashbonde jackielxu msm1089 shubhampachori12110095 zuewang chenxingzhang1997 yuguorui marcinja liyungithub c1a1o1 142857why xiangyuwei alamehor moomonkey zpeng1989 ahiroto eridgd aozhi naveen18 r1cebank upml xujiaba chienlinhuang1116 phonamnuaisuk vbirbal afcarl 0tao wushicanasl aaaaaaada jennychiou ch-yyk anigi98932 dhruvramani kurian-thomas ella77 audioai pelfsollution stavrev lijuan123 tys1128 jonsatt themidwestcanapps sheliakang ssgalitsky tamwaiban ptaati berryai whmnoe4j iewbgfnydwhrorrsktkdymduzgdwubygdktdjwd maciejsaw asksasasa83 meiqiaofei samdevo nirupam1sharma zoe-yjy cbiehl mbncr cwncdnc markusbuchholz listenwhy

neural-style-audio-tf's Issues

Why the hell is this not talked about more?

You added this 3 years ago, and I am just now finding it. I have been searching for an implementation of neural style that treats music as the images, in this case waveforms. This is amazing have you built more upon this? Thanks for this repo.

add the ability to load a pretrained net?

Though I fully trust Dmitry and believe in his claim that a random cnn is as good as a pretrained net in detecting and extracting texture features (the "style"), I would really appreciate the possibility of testing some pretrained net for extracting the "content" features.
While experiencing with this lovely software I found that its ability to discriminate the content structure in "content" sound files does not appear as accurate as in the examples provided elsewhere for the "image style transfer" case. In particular it seems that too much of the style still remains in the content, and this is perhaps the cause of high dominance of some audio files when combined with others.
I noted that the best combinations (i.e., where the "content" audio imposes only its structure and the "style" audio enforces its own texture) are produced, when the spectra of the two audios share most of their frequencies, but the "style" has less structure or, in other words, less evident "beats". This would correspond, in images, to the "style" image having mostly the same spectrum as the "content" one, but featuring weaker and shorter edges. The output audio, in this case, resembles a combination of an "envelope" taken from the "content" audio, modulating the amplitude of the "style" audio.
On the other hand, when the "style" audio lies on a mostly different region of the frequency spectrum (e.g. higher frequencies) with respect to the "content", then the two audios get mixed (their spectra appear to be merged) and both are almost equally present in the output, producing in most cases very confusing output.
I can provide some examples, but I guess anyone can figure out what I'm trying to explain, by testing on the available audio samples.
By looking at the results produced by applying style transfer to images, I would expect a different behavior, where the style (i.e. the texture) of the "style" image almost completely substitute the texture of the "content" image. I suspect that some more investigation might be needed in the selection of the most suitable net for content features selection, and therefore I would love to have some hints about how to load and use a pretrained network.
Sorry for the long message.

Run on terminal in Ubuntu

hello @DmitryUlyanov ,
based on your github it said to be run on jupyter, can i run it in terminal in ubuntu and how?

S = np.log1p(np.abs(S[:,:430]))

in read_audio_spectum @ 4th cell, S = np.log1p(np.abs(S[:,:430])) . What's the purpose of constant 430?
Much thx!
@DmitryUlyanov

Why no 1D convolution ?

In your blog, you wrote 1D convolution works better than 2D ones. But this tensorflow version didnt use Conv1D but Conv2D. Why is that? any reason or am I missing out something else?

Getting Syntax error and deprecated TF function

Hi Dmitry,
thanks for putting this together, this is exactly what I was looking for an experiment!
I am definitely a beginner in this, but I was trying to run your example and I get a Syntax error on the Optimise kernel and in the Output in the print as now you have to add parenthesis.

File "<ipython-input-16-9eb962c6044b>", line 50
    print 'Final loss:', loss.eval()
                      ^
SyntaxError: invalid syntax

I also figured out that tf.initialize_all_variables() is now deprecated and so changed it to tf.global_variables_initializer()

Then it all works well!
Thanks!

learning style from more than one example?

I'm trying to understand if it would make sense to learn style from a group of examples (in this case, audio files) instead of just one. In the best case this would produce a sort of "mean style" representing the group of audio excerpts. In your experience, would such an approach work (as long as the examples do somehow share some style in common), or it would produce just garbage?

AttributeError: module 'librosa' has no attribute 'output'

Hi Dmitry,
I wanted to try to do a audio style transfer, but get this error on the optimize and invert spectrum step.

Started optimization.
INFO:tensorflow:Optimization terminated with:
Message: b'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
Objective function value: 1785.756958
Number of iterations: 300
Number of functions evaluations: 309
Final loss: 1785.7569580078125

AttributeError Traceback (most recent call last)
in ()
----> 1 get_ipython().run_cell_magic('time', '', 'from sys import stderr\n\n#@markdown ---\n#@markdown Advanced settings / Расширенные настройки\nALPHA= 0.1 #@param {type:"slider", min:0.01, max:0.2, step:0.01}\nlearning_rate= 0.01 #@param {type:"slider", min:0.001, max:0.02, step:0.001}\niterations = 300 #@param {type:"slider", min:100, max:500, step:10}\n#@markdown ---\nresult = None\nwith tf.Graph().as_default():\n\n # Build graph with variable input\n #x = tf.Variable(np.zeros([1,1,N_SAMPLES,N_CHANNELS], dtype=np.float32), name="x")\n x = tf.Variable(np.random.randn(1,1,N_SAMPLES,N_CHANNELS).astype(np.float32)*1e-3, name="x")\n\n kernel_tf = tf.constant(kernel, name="kernel", dtype='float32')\n conv = tf.nn.conv2d(\n x,\n kernel_tf,\n strides=[1, 1, 1, 1],\n padding="VALID",\n name="conv")\n \n \n net = tf.nn.relu(conv)\n\n content_loss = ALPHA * 2 * tf.nn.l2_loss(\n net - content_features)\n\n style_loss = 0\n\n _, height, width, number = map(lambda i: i.value, net.get_shape())\n\n size = height * width * number\n feats = tf.reshape(net, (-1, number))\n gram = tf.matmul(tf.transpose(feats), feats) / N_SAMPLES\n style_loss = 2 * tf.nn.l2_loss(gram - style_gram)\n\n # Overall loss\n loss = content_loss + style_loss\n\n opt = tf.contrib.opt.S...

2 frames
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2115 magic_arg_s = self.var_expand(line, stack_depth)
2116 with self.builtin_trap:
-> 2117 result = fn(magic_arg_s, cell)
2118 return result
2119

in time(self, line, cell, local_ns)

/usr/local/lib/python3.7/dist-packages/IPython/core/magic.py in (f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):

/usr/local/lib/python3.7/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
1191 else:
1192 st = clock2()
-> 1193 exec(code, glob, local_ns)
1194 end = clock2()
1195 out = None

in ()

AttributeError: module 'librosa' has no attribute 'output'

what does it take to produce longer outputs?

hello Dmitry,
a quick question. How do I produce longer output files with this approach? Should I necessary provide longer inputs or there is another way?
Thank you very much for sharing your results
Giancarlo

Blank screen error

Hi Dimitry :)
I have encountered some kind of error, when trying to transfer style from one song to another.
After running few cells, screen goes black and I cannot use keyboard nor mouse, cant enter tty mode - looks like regular system crash.
I'm using ubuntu 16.04 with tensorflow gpu (geforce 760gti 2gb vram).
Is this problem caused by using gpu version?