What do all recent super-powerful image models like , , or have in common? Other than their high computing costs, huge training time, and shared hype, they are all based on the same mechanism: diffusion. DALLE Imagen Midjourney Diffusion models recently achieved state-of-the-art results for most image tasks including text-to-image with DALLE but many other image generation-related tasks too, like image inpainting, style transfer or image super-resolution. But how do they work? Learn more in the video... References ►Read the full article: https://www.louisbouchard.ai/latent-diffusion-models/ ►Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684–10695), https://arxiv.org/pdf/2112.10752.pdf ►Latent Diffusion Code: https://github.com/CompVis/latent-diffusion ►Stable Diffusion Code (text-to-image based on LD): https://github.com/CompVis/stable-diffusion ►Try it yourself: https://huggingface.co/spaces/stabilityai/stable-diffusion ►Web application: https://stabilityai.us.auth0.com/u/login?state=hKFo2SA4MFJLR1M4cVhJcllLVmlsSV9vcXNYYy11Q25rRkVzZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIFRjV2p5dHkzNGQzdkFKZUdyUEprRnhGeFl6ZVdVUDRZo2NpZNkgS3ZZWkpLU2htVW9PalhwY2xRbEtZVXh1Y0FWZXNsSE4 ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/ Video Transcript 0:00 what do all recent super powerful image 0:02 models like delhi imagine or mid journey 0:05 have in common other than high computing 0:08 cost huge training time and shared hype 0:10 they are all based on the same mechanism 0:13 diffusion the fusion models recently 0:15 achieved state-of-the-art results for 0:17 most image tasks including text to image 0:19 with delhi but many other image 0:21 generation related tasks like image and 0:23 painting style transfer or image super 0:25 resolution though there are a few 0:27 downsides they work sequentially on the 0:30 whole image meaning that both the 0:31 training and inference times are super 0:34 expensive this is why you need hundreds 0:36 of gpus to train such a model and why 0:38 you wait a few minutes to get your 0:40 results it's no surprise that only the 0:42 biggest companies like google or openai 0:45 are releasing those models 0:47 but what are they i've covered diffusion 0:49 models in a couple of videos which i 0:51 invite you to check for a better 0:52 understanding they are iterative models 0:55 that take random noise as inputs which 0:57 can be conditioned with a text or an 0:59 image so it's not completely random it 1:02 iteratively learns to remove this noise 1:04 by learning what parameters the models 1:06 should apply to this noise to end up 1:08 with a final image so the basic 1:10 diffusion models will take a random 1:12 noise with the size of the image and 1:14 learn to apply even further noise until 1:17 we get back to a real image this is 1:19 possible because the model will have 1:21 access to the real images during 1:23 training and will be able to learn the 1:25 right parameters by applying such noise 1:27 to the image iteratively until it 1:29 reaches complete noise and is 1:31 unrecognizable 1:33 then when we are satisfied with the 1:35 noise we get from all our images meaning 1:37 that they are similar and generate noise 1:40 from a similar distribution we are ready 1:42 to use our model in reverse and feed it 1:45 similar noise in the reverse order to 1:48 expect an image similar to the ones used 1:50 during training so the main problem here 1:53 is that you are working directly with 1:54 the pixels and large data input like 1:57 images let's see how we can fix this 1:59 computation issue while keeping the 2:02 quality of the results the same as shown 2:04 here compared with delhi but first give 2:07 me a few seconds to introduce you to my 2:09 friends at quack sponsoring this video 2:11 as you most certainly know the majority 2:13 of businesses now report ai and ml 2:15 adoption in their processes but complex 2:18 operations such as modal deployment 2:20 training testing and feature store 2:22 management seem to stand in the way of 2:24 progress ml model deployment is one of 2:26 the most complex processes it is such a 2:29 rigorous process that data scientist 2:31 teams spend way too much time on solving 2:33 back-end and engineering tasks before 2:35 being able to push the model into 2:37 production something i personally 2:39 experienced it also requires very 2:42 different skill sets often requiring two 2:44 different teams working closely together 2:46 fortunately for us quack delivers a 2:48 fully managed platform that unifies ml 2:50 engineering and data operations 2:53 providing agile infrastructure that 2:55 enables the continuous productization of 2:57 ml models at scale you don't have to 2:59 learn how to do everything end-to-end 3:01 anymore thanks to them quack empowers 3:04 organizations to deliver machine 3:06 learning models into production at scale 3:08 if you want to speed up your model 3:10 delivery to production please take a few 3:12 minutes and click the first link below 3:14 to check what they offer as i'm sure it 3:16 will be worthwhile thanks to anyone 3:18 taking a look and supporting me and my 3:20 friends at quack 3:23 how can these powerful diffusion models 3:25 be computationally efficient by 3:27 transforming them into latent diffusion 3:30 models this means that robin rumback and 3:32 his colleagues implemented this 3:34 diffusion approach we just covered 3:36 within a compressed image representation 3:38 instead of the image itself and then 3:41 worked to reconstruct the image so they 3:43 are not working with the pixel space or 3:45 regular images anymore working in such a 3:48 compressed space does not only allow for 3:50 more efficient and faster generations as 3:52 the data size is much smaller but also 3:54 allows for working with different 3:56 modalities since they are encoding the 3:58 inputs you can feed it any kind of input 4:00 like images or text and the model will 4:03 learn to encode these inputs in the same 4:05 sub space that the diffusion model will 4:07 use to generate an image so yes just 4:10 like the clip model one model will work 4:13 with text or images to guide generations 4:16 the overall model will look like this 4:18 you will have your initial image here x 4:21 and encode it into an information then 4:23 space called the latent space or z this 4:26 is very similar to a gun where you will 4:29 use an encoder model to take the image 4:31 and extract the most relevant 4:32 information about it in a subspace which 4:35 you can see as a down sampling task 4:37 reducing its size while keeping as much 4:39 information as possible you are now in 4:42 the latent space with your condensed 4:44 input you then do the same thing with 4:46 your condition inputs either text images 4:49 or anything else and merge them with 4:50 your current image representation using 4:53 attention which i described in another 4:55 video this attention mechanism will 4:57 learn the best way to combine the input 4:59 and conditioning inputs in this latent 5:01 space adding attention a transformer 5:04 feature to diffusion models these merged 5:07 inputs are now your initial noise for 5:09 the diffusion process 5:11 then you have the same diffusion model i 5:13 covered in my image and video but still 5:16 in this subspace finally you reconstruct 5:19 the image using a decoder which you can 5:21 see as the reverse step of your initial 5:23 encoder taking this modified and 5:25 denoised input in the latent space to 5:28 construct a final high resolution image 5:31 basically upsampling your results and 5:34 voila this is how you can use diffusion 5:36 models for a wide variety of tasks like 5:39 super resolution in painting and even 5:41 text to image with the recent stable 5:44 diffusion open sourced model through the 5:46 conditioning process while being much 5:49 more efficient and allowing you to run 5:51 them on your gpus instead of requiring 5:54 hundreds of them you heard that right 5:56 for all devs out there wanting to have 5:58 their own text to image and image 6:00 synthesis model running on their own 6:02 gpus the code is available with 6:04 pre-turned models all the links are 6:06 below if you do use the model please 6:08 share your tests ids and results or any 6:10 feedback you have with me i'd love to 6:13 chat about that of course this was just 6:15 an overview of the latent diffusion 6:17 model and i invite you to read their 6:19 great paper linked below as well to 6:21 learn more about the model and approach 6:24 huge thanks to my friends at quack for 6:26 sponsoring this video and even bigger 6:28 thanks to you for watching the whole 6:30 video i will see you next week with 6:33 another amazing paper