Vision Quest
Posts
Future Forecast 002

Future Forecast 002

OpenAI announces Sora for text-to-video, come see how it will change everything

Vision Quest
February 23, 2024

OpenAI announces Sora for text-to-video, come see how it will change everything

In this edition of Future Forecast we focus on the latest announcement from OpenAI - Sora. We’ll walk you through what Sora is, demonstrate examples of how it’s been used so far and make some predictions about what the future holds for this technology.

Embark with us on an explorative expedition into the future of technology, where we navigate the possibilities and tackle the emerging challenges together.

OpenAI’s Sora

Okay….but what is it?

Sora is a diffusion model with transformer architecture, which means its neural network functions in a similar way to ChatGPT [1]. Diffusion models are used for generating high quality detailed images by gradually transforming random noise distribution into a coherent image structure through a reverse diffusion process. Imagine you have a messy, scribbly picture. A diffusion model is like a magic eraser that slowly cleans up the mess, step by step, until you get a really clear and pretty picture at the end.

Transformer architecture, on the other hand, is known for its effectiveness in understanding complex data relationships, especially in natural language processing, due to its self-attention mechanism. Think of the transformer architecture like a really smart detective who is great at solving puzzles. It looks at all the messy, scribbly lines and figures out how they are connected to make sense of the whole picture, just like figuring out the clues in a mystery. It pays special attention to the important parts, so it knows which pieces of the puzzle to put together first.

By integrating these two, the model can efficiently generate or process complex data, like images, text, or in this case short video clips with a deeper understanding of context and details.

There has been a lot of speculation since the model was released debating how the videos come out looking so realistic. Some think that it’s creating real world simulations in Unreal Engine while others think it has an expert understanding of physics. Well, turns out neither of those are true. For a full breakdown of how Sora truly works check out this post on Medium by Mike Young. He read OpenAI’s research post for us and explained it in simpler terms and explains why Sora is a big deal.

“Unlike static images, video inherently involves representing change over time, 3D spaces, physical interactions, continuity of objects, and much more. Past video generation models have struggled to handle diverse video durations, resolutions, and camera angles…these systems lack the intrinsic “understanding” of physics, causality, and object permanence needed to produce high-fidelity simulations of reality.”

Sora’s not your average AI. If it was, OpenAI might have released it to the public when they made their announcement. But unfortunately it was just a tease. The company is allowing a select few to test it out and see what it creates while finding issues that shouldn’t make it out to the masses.

This is where text-to-video was a year ago with the famous Will Smith eating spaghetti example (it’s a little disturbing compared to what Sora can do)

AI Will Smith eating spaghetti
— Time Capsule Tales (@timecaptales)
8:15 PM • Feb 20, 2024

Now that that image is in your head, let’s check out some of the examples of what people have been creating with Sora. It’s wild to see a single text prompt generate something of this magnitude compared to a year ago with the spaghetti monstrosity.

Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
openai.com/sora
Prompt: “Beautiful, snowy… twitter.com/i/web/status/1…
— OpenAI (@OpenAI)
6:14 PM • Feb 15, 2024

Prompt: “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. she wears a black leather jacket, a long red dress, and black boots, and carries a black purse. she wears sunglasses and red lipstick. she walks confidently and casually.… twitter.com/i/web/status/1…
— OpenAI (@OpenAI)
6:14 PM • Feb 15, 2024

This is how Mike Young describes the difference from previous iterations of text-to-video technology.

“The videos released from OpenAI qualitatively show a model that performs better than anything we’ve seen in these areas. The videos, frankly, look real. For example, a person’s head will occlude a sign and then move past it, and the text on the sign will remain as before. Animals will shift wings realistically even when “idle.” Petals in the wind will follow the breeze. Most video models fail at this kind of challenge, and the result tends to be some flickery, jittery mess that the viewer’s mind has to struggle to make coherent.”

These videos (especially the woman walking through the city) are incredibly impressive. It’s hard to tell that these were generated by AI and are not videos of the real world. A year ago, compared to the spaghetti monster it would’ve seemed impossible to generate the reflection off of the sunglasses in a city environment at such high fidelity. But here we are. On the brink of another breakthrough in artificial intelligence that will have an impact on society and the internet from this day forward. Well from the day it’s released really.

The Future

Once this is available to the public what do you think is going to happen? What will happen to the content creators that spend hours and hours shooting and editing videos to get the perfect shot that will help them garner millions of views? What happens when you can cut the time to make high quality content by 75%, 90% or 99%?

Like we mentioned in last weeks Future Forecast, by 2030 life is really going to change. This breakthrough will have a huge impact on the internet and in turn how we live our daily lives. I was skeptical about how much AI was going to “take our jobs” but after seeing this I can truly see how there will be people put out of work by this technology. It doesn’t mean that there will be no one making content the old fashioned way or that every person making video will be replaced. But I do think it means that someone using this technology is going to replace 10 people who aren’t. Keep up or be left behind as it were.

AI itself is not ready to replace us. It can only help us at this point. So those that stay ahead of the curve and ride the wave of advancement into the future are going to be the beneficiaries of this change. We have our surfboards ready and are planning on getting into the race when it comes to using the latest technology.

At Vision Quest there is immense potential to leverage Sora combined with other technologies like Elevenlabs to produce high quality content for our subscribers. Check out this example that you wouldn’t believe is fully made by AI.

People aren't taking the "everyone will be filmmakers" seriously enough.
I made this 20s trailer in 15mins with an OpenAI Sora clip, David Attenborough's voice on Eleven Labs, and sampling some nature music from Youtube on iMovie
— Deedy (@debarghya_das)
3:51 AM • Feb 16, 2024

Will we become film makers at Vision Quest? Perhaps….but most likely we’ll use it to create content that compliments our newsletter editions that can be shared on other platforms like YouTube, Instagram, TikTok, Facebook, X (Twitter), etc. It allows us without much effort or previous experience to product something that used to take years of expertise, equipment and either shooting on location or extensive 3D environment creation and editing. The more I type this the more groundbreaking this advancement feels.

There are so many people out there already using AI generated images for their blogs, news articles, video thumbnails, instagram posts and more that I can’t imagine how many AI generated videos are going to take over the internet once it’s in the hands of everyday people. But besides people businesses are also getting up to speed with this technology. How will this change commercials, TV shows and movies? We already went through a strike in Hollywood where actors and writers were fighting against image and text based AI. I wonder if they were also aware that this technology was on it’s way to create video so easily.

All in all, this technology will make waves. Whether it’s by the common human or a giant corporation once Sora is fully available, it’ll be flooding the internet with videos in no time. Time will tell how else this technology will be used or where it will take society.

If you made it alllllll the way down here we appreciate you! Let us know if you liked this edition of Future Forecast and if you’d read it again.

Would you read the Future Forecast again?

What did you think of this edition of Future Forecast?

	Sponsored simple.ai by @dharmeshAI and agents, made simple. Learn how to grow your career or business in the AI age with Dharmesh Shah (co-founder & CTO of HubSpot). Join 1,000,000+ readers.