Are Visual Tokens More Effective Than Text Tokens for Language Tasks?

It's on the news: we've a new DeepSeek moment. A bit less fancy, maybe, but quite a breakthrough that may redefine LLMs. In it's most recent paper, the whale presents DeepSeek-OCR, a powerful new OCR (Optical Character Recognition) model.

While its OCR capabilities are impressive, the real intrigue for us lies in a more fundamental question it prompts.

As a community, we've defaulted to text tokens as the universal input for LLMs. But what if this is a wasteful, limiting bottleneck? What if pixels are a more effective and natural input stream?

The Case for a Pixel-Only LLM Diet

Imagine an LLM that only ever consumes images. Even pure text would be rendered into an image before being processed. This might sound counterintuitive, but the potential benefits are compelling:

1. Superior Information Compression

The DeepSeek-OCR paper hints at this: visual information can be more densely packed. A single image can convey the semantic meaning of a paragraph of text, along with its visual presentation. This could lead to significantly shorter effective context windows, translating directly into higher speed and lower computational cost.

2. A Radically More General Input Stream

Text tokens are a narrow, impoverished data type. By switching to pixels, your model's "language" becomes the universal language of visual information. The input can be:

Formatted Text: Bold, italics, colors, and font sizes—all of which carry meaning—are naturally understood.
Mixed Modalities: Charts, diagrams, memes, and real-world photographs are processed natively alongside text. No more complex, multi-modal pipelines; everything is just an image.

3. The Power of Bidirectional Attention by Default

Text generation is autoregressive—we predict the next token from the left to right. But understanding text is not. When you process an image, there's no inherent sequential order. The model can use bidirectional attention across the entire "scene" from the start, leading to a richer and more powerful understanding of the input context.

The Tokenizer Must Go

This is a hill we're prepared to defend: the tokenizer is a problematic relic.

It's an ugly, separate preprocessing stage that breaks end-to-end learning. It imports all the historical baggage of Unicode, byte encodings, and security vulnerabilities (continuation byte attacks, anyone?). Most damningly, it severs the model from the real world.

Two characters that look identical to the human eye can be mapped to two completely different, unrelated tokens. A smiling emoji becomes a weird, abstract token, not a visual representation of a smile that can benefit from the model's inherent understanding of faces and emotions.

The tokenizer creates a brittle, artificial layer between the user and the model. Replacing it with a direct pixel-based interface would be a monumental step towards more robust and intuitive AI.

The Practical Path Forward

So, what would this look like in practice? The most viable architecture today is a Visual Encoder + LLM Decoder.

The user's message—whether it's a screenshot of a document, a photo of a whiteboard, or a rendered snippet of text—is fed in as pixels. The visual encoder processes this into a latent representation that the LLM decoder then uses to generate a coherent text response.

This approach elegantly sidesteps the most significant technical hurdle: generating high-fidelity, coherent images as output. For the vast majority of practical applications, we want a textual assistant, not a paintbrush. We keep the powerful text decoder we've spent years perfecting and simply give it a better set of eyes.

Conclusion: A Vision-Centric Future for LLMs?

OCR is just one application in a much broader paradigm. By framing all inputs as visual tasks, we unlock a world of generality and efficiency. While the text-only tokenizer has brought us far, it may be holding us back from the next leap in AI capability.

The question isn't just whether visual tokens are more effective for some language tasks—it's whether they are the universally better foundation for building general-purpose, multimodal reasoning systems.

The future of LLM inputs might not be text at all—it might be a stream of pixels.

Sin categoría

2024 AI in Review: Make it make sense

At the start of 2024, the AI industry was buzzing with optimism. The explosive growth of 2023 had pushed boundaries and brought an unprecedented wave of tools, innovations, and excitement. There were a few clouds on the horizon—questions about sustainability, rights, and regulation—but they felt distant, overshadowed by the sheer momentum of progress.

As the year unfolded, however, those clouds began to grow darker. The rapid pace of growth slowed, and a more critical lens was applied to the industry’s trajectory. When we published "Is an AI Disillusion Doom Upcoming?" earlier this year in this blog, we acknowledged these early signs of change but framed them within the context of growth and recalibration. Now, as 2024 draws to a close, the picture has come into sharper focus. What initially appeared as clouds of uncertainty have proven to be catalysts for reflection and refinement.

This year wasn’t about collapse but about rethinking priorities—a collective pause to ask hard questions and make deliberate choices. The patterns we’ve seen, reflected in industry data and the changing priorities of key players, tell a story of maturation. While 2023 was all about the thrill of what was possible, 2024 has been about figuring out what’s sustainable, ethical, and genuinely impactful. Far from signaling doom, this recalibration is a sign of progress, marking AI’s evolution from a shiny novelty into a meaningful cornerstone of technology.

While the industry has faced a slowdown in growth, it is important to note that we are not yet in a Trough of Disillusionment. For the moment, the AI ecosystem seems to be undergoing a recalibration—a phase where enthusiasm meets practical assessment. The industry’s momentum, though tempered, continues to push forward with deliberate innovation and purpose.

In Search of a Plateau of Productivity

2023 was a year of explosive growth, with AI tool releases reaching their zenith. However, 2024 saw a decline in new releases, as reflected in industry data, leading many to speculate about the industry’s direction. This apparent slowdown is not indicative of stagnation but a pivot towards quality, ethical considerations, and meaningful integration over quantity. The AI industry has shifted its focus from dazzling novelty to practical application, accountability, and long-term sustainability.

Key Insights from 2024

1. Market Saturation and Strategic Refinement

The surge of tools in 2023 created an ecosystem brimming with options but also one where differentiation became increasingly difficult. In 2024, we witnessed:

Fewer but Better Tools: Companies focused on refining existing tools to address user feedback and deliver enhanced reliability and usability.
Ecosystem Building: Rather than standalone tools, developers concentrated on creating platforms and integrations, enabling seamless workflows across industries.

2. Ethical and Regulatory Frameworks Gained Traction

This year underscored the importance of responsible AI. Ethical AI moved from a buzzword to a tangible priority as:

Governments and international bodies introduced comprehensive regulations to govern AI’s use, addressing concerns about privacy, bias, and accountability.
Organizations adopted frameworks to ensure transparency, explainability, and fairness in AI systems.

3. Focus on Practical Impact

2024 demonstrated a clear shift towards AI applications that solve real-world problems rather than chasing hype. Sectors like healthcare, climate tech, and education benefited from:

AI-powered diagnostic tools improving patient outcomes.
Predictive models aiding in disaster preparedness and resource allocation.
Personalized learning systems enhancing accessibility and engagement.

Lessons Learned

Reality Over Hype

2024 reminded us that AI’s potential is vast but not limitless. Recognizing its constraints has fostered a more grounded approach to development, paving the way for sustainable growth.

Human Collaboration is Key

The most successful AI implementations this year emphasized augmentation rather than replacement. By enabling humans and AI to work collaboratively, organizations achieved results that neither could accomplish alone.

Patience in Progress

Innovation cycles naturally ebb and flow. While 2023 was about acceleration, 2024 emphasized stabilization. This phase is essential for ensuring that AI’s foundation is solid before the next wave of advancements.

Looking Ahead: 2025 and Beyond

The path forward for AI lies in striking a balance between ambition and responsibility. Here are the priorities for the industry as we step into 2025:

1. Doubling Down on Research and Development

Invest in addressing core challenges, such as reducing energy consumption in AI models and improving algorithmic transparency.
Explore interdisciplinary solutions, bringing together AI experts, ethicists, and domain specialists to drive innovation.

2. Strengthening AI’s Role in Society

Expand the scope of AI’s applications in underserved areas like agriculture, public health, and social services.
Use AI to tackle global challenges, including climate change and resource scarcity, by developing predictive and optimization tools.

3. Building Trust Through Education and Transparency

Enhance AI literacy across all levels of society to demystify the technology and build user confidence.
Open-source initiatives and collaborative projects can foster innovation while maintaining accountability.

Don't forget it's just the begining

It’s worth remembering that this is still an incipient phase. The seminal paper "Attention Is All You Need," which introduced the transformer architecture, is not even a decade old. In less than ten years, this innovation has sparked breakthroughs that have revolutionized how we interact with and utilize AI. It has enabled natural language processing capabilities that power conversational AI, such as chatbots and virtual assistants, and given rise to models that generate coherent text, code, and even art. Beyond these, it has transformed industries, from improving diagnostics in healthcare to automating complex processes in finance and beyond. This is just the beginning of what this foundational technology can achieve, reminding us that we are still at the dawn of the AI era.

The narrative of 2024 wasn’t one of doom but of growth through introspection. AI’s trajectory mirrors that of any transformative technology—an initial surge of enthusiasm, tempered by reflection and recalibration, followed by sustainable advancement.

As we move into 2025, the focus will be on harnessing AI’s capabilities to address humanity’s most pressing challenges while ensuring that its development remains inclusive, ethical, and impactful. AI’s story is far from over—in fact, it is entering one of its most exciting chapters yet.

Sin categoría

Artificial Intelligence Video: Think in Workflows, Not Tools or Models

When businesses approach AI-powered video creation and analysis, the tendency is often to dive head-first into exploring the latest models, flashy tools, or platforms promising instant results. Yet, the most effective way to harness the potential of AI in video applications is not by chasing technology alone, but by focusing first on the workflows that drive your business goals.

Why Workflows Over Tools?

Artificial Intelligence tools evolve rapidly—today's cutting-edge model could become obsolete in months. If your strategy revolves solely around specific AI tools or models, you risk continuous disruption and inefficiencies as the landscape changes.

By shifting focus to workflows, you create a stable and adaptable foundation that aligns directly with business objectives, whether that's streamlined content creation, automated video editing, or intelligent video analytics.

Designing AI-Powered Video Workflows

When thinking in workflows, start by clearly mapping out your objectives:

Content Production: Are you looking to automate video generation from textual or visual inputs?
Post-Production Efficiency: Do you need AI-driven editing, subtitling, or enhancement?
Analytics and Insight: Are your priorities focused on extracting insights, sentiment, or actionable data from existing videos?

Next, define each stage of your process clearly:

Input Stage: Determine the type of content you'll start with—text, raw footage, images, or audio.
Processing Stage: Decide how AI will transform your inputs—through editing, scene recognition, captioning, summarization, or sentiment analysis.
Output Stage: Clarify your desired outputs—finalized videos, annotated content, or data-driven reports.

Workflow-Focused Benefits

Enhanced Flexibility

When your workflow is clearly defined, swapping or upgrading AI tools becomes straightforward. Your processes become resilient, quickly adapting to the latest innovations without losing sight of your primary objectives.

Improved Scalability

Workflow-centric designs inherently promote scalability. Clearly defined stages allow easy replication and expansion across multiple teams, clients, departments, or markets.

Clearer ROI

Workflows aligned with business goals enable you to measure success more precisely, evaluating the actual impact of AI beyond mere novelty. This ensures every AI initiative contributes meaningfully to your organization's bottom line.

Practical Steps to Workflow Thinking

Start by auditing existing video-related processes in your organization.
Identify bottlenecks or inefficiencies that AI could solve.
Evaluate available technologies purely on their ability to improve these workflows. Don't just think about tools that create video from text as there are a wide range of skilled models for other tasks.
Pilot and iterate rapidly, adjusting workflows based on real-world feedback.

The Bottom Line

AI is powerful—not because of specific models or flashy tools, but because of how it transforms and enhances workflows. Businesses that think workflow-first build resilience, adaptability, and measurable value into their video strategies. Embrace the workflow mindset, and let the tools serve your objectives, not dictate them.

Sin categoría

The Shadowed Reality of AI in Modern Business

Running a modern company using AI tools that are not fully understood has become increasingly common in today’s business landscape. At first glance, this approach seems practical and even forward-thinking. However, deeper examination reveals a striking parallel with an ancient philosophical tale: Plato’s allegory of the cave. In this story, prisoners are bound in a cave, perceiving reality only through shadows cast on the wall by a fire behind them. Their world is limited to these distorted silhouettes, never grasping the true objects that cast them.

The experience of leading or working within a company reliant on opaque AI systems can mirror this restricted vision. Leaders and employees may find themselves making decisions based on AI outputs—outputs that, like shadows, represent only fragments of a more complex underlying reality. But how often do these outputs truly reflect the broader truths of the business world? What risks arise when decisions are based on such filtered representations? How do we distinguish actionable insights from mere projections in high-speed, high-stakes environments? This piece explores the impacts of this phenomenon on AI ownership, the influence of erroneous representations, and the fast-paced decision-making environment fostered by such tools.

AI’s Illusive Reflections

In Plato’s allegory, prisoners are confined to a cave, perceiving only the shadows of objects projected onto a wall by a fire behind them. Their understanding of reality is shaped by these shadows, which are merely distorted reflections of the true forms outside the cave. This analogy aptly represents the experience of managing or working in a company dependent on AI systems whose inner workings remain opaque. Leaders often make decisions based on AI-generated outputs that, much like the shadows in the cave, are representations of complex underlying algorithms and data structures.

AI tools, while powerful, can project a version of reality filtered and influenced by the data they are trained on. This data might be incomplete, biased, or erroneous, leading to decisions based on a distorted version of reality. Leaders and employees in such settings may believe they are acting on true insights when, in fact, they are only responding to abstractions—the ‘shadows’ cast by AI.

The Pursuit of True Ownership

Achieving true ownership of AI within a company is an ongoing pursuit, akin to the journey out of Plato’s cave. Executives may claim ownership of advanced AI tools and platforms, but real ownership involves understanding the systems’ mechanics and harnessing them effectively. This state mirrors the prisoners’ initial relationship to the shadows; they perceive ownership of what they see but must strive to control and comprehend the source of these projections. True ownership means pursuing the knowledge that transcends mere perception, enabling leaders to step out of the cave and make informed decisions.

Effective AI ownership goes beyond deploying machine learning models or predictive analytics tools. It requires a foundational understanding of how these systems process data, derive conclusions, and evolve. Without this knowledge, ownership remains nominal, and companies risk being guided by tools that function beyond their grasp. Just as the prisoners in Plato’s cave mistook shadows for reality, company leaders may mistake AI outputs for comprehensive insights without truly understanding the underlying processes.

In the allegory, a prisoner who escapes the cave and sees the world outside is initially blinded by the light, representing the challenge of adjusting to true knowledge. For companies, this transition can be likened to leaders moving beyond surface-level interactions with AI to confronting the complexity of machine learning models, ethical data practices, and the biases present in training data. Only by acknowledging and addressing these biases can leaders step out of the ‘cave’ and see the full spectrum of reality.

Garbage In, Shadows Out: Faux Realities and Erroneous Information

This scenario is increasingly prevalent in today’s business landscape. Companies rely heavily on AI algorithms to analyze data, predict trends, and make critical decisions. Yet how many truly understand how these algorithms function? Trusting them implicitly can be like prisoners chained in the cave, mistaking shadows for reality.

AI systems are only as reliable as the data and algorithms they are built upon. When these elements are flawed—whether due to biased training data, erroneous assumptions, or unintended feedback loops—the AI’s outputs can present a faux reality. These representations, polished and backed by ostensibly objective data, can be persuasive and lead decision-makers to act on inaccurate information. Even the most sophisticated AI tools must be met with scrutiny; outputs should be guides, not absolute truths. Balancing trust in AI with critical assessment ensures outputs are used effectively, promoting a more informed approach.

Consider a scenario where a human agent or decision-maker acts on an AI model trained on flawed data. One key appeal of AI tools is their ability to facilitate rapid decision-making. However, this speed comes with risks. High-frequency decisions based on AI outputs can accelerate a company’s operations but also amplify errors at an equal rate if the AI’s interpretations are flawed. In the cave analogy, the prisoners’ quick reactions to shadows resemble the real-time decision-making pressures businesses face. If those shadows do not represent reality accurately, rapid responses can lead to compounding mistakes.

For example, an AI-driven marketing tool might recommend aggressive targeting based on skewed demographic data from past campaigns. Acting quickly on such insights could reinforce biases and limit market outreach. Here, the shadows cast by AI—simplified reality based on narrow patterns—shape decisions without full contextual understanding.

Filling the Room with Light

To break free from the metaphorical cave, companies must invest in developing AI literacy at all levels. This means training leaders to understand the basics of algorithmic functions, data sources, and potential biases influencing AI outputs. Transparency in AI development and a culture that encourages questioning outputs empower teams to discern when they are seeing mere shadows and when they have reached deeper insights.

Companies that make the effort to peer beyond the shadows can harness AI not just for efficiency but as a partner in innovation. This requires commitment to continuous education, cross-disciplinary collaboration, and ethical AI practices that identify and mitigate data biases.

Ultimately, managing a company with powerful AI tools without fully understanding them is akin to being bound in Plato’s cave. The challenge for modern businesses is to recognize when they are looking at shadows, understand the risks of decision-making based on potentially skewed representations, and strive for knowledge that leads them into the light.

The path out of the cave demands conscious effort and a willingness to challenge the status quo. Only then can businesses navigate the complexities of the modern world with confidence, using AI as a tool for progress rather than a blind guide leading them astray.