El data lake no está roto, están rotos los acuerdos

🏗️ El data lake no está roto, están rotos los acuerdos

Cuando un proyecto de datos sale mal, lo primero que señala la empresa es la plataforma. Que si fue una mala elección, que si el equipo técnico no supo implementarla, que si el proveedor no cumplió lo que prometió. Ese ciclo de culpa es muy común en Colombia y casi siempre apunta al lugar equivocado.

El caos en un data lake empieza mucho antes de que el primer dato llegue al sistema.

Un data lake es un repositorio central donde la empresa almacena grandes volúmenes de datos en su estado original, antes de procesarlos. Piénselo como la bodega de una obra de construcción donde llegan ladrillo, tubería, cable y madera sin etiquetar ni categorizar. El depósito en sí no tiene la culpa de que la obra quede mal. El desorden empieza en el pedido.

Cuando el área de ventas manda un archivo donde la columna ventas netas incluye devoluciones y el área financiera manda otro donde no las incluye, los dos entran al sistema con el mismo nombre de campo. El analista tarda días en descubrir la diferencia y el informe se devuelve. Es como cuando llega ladrillo sin especificación de medida, entra igual que el demás y nadie lo nota hasta que no cupo donde debía ir.

Ese reproceso ocurre en muchas empresas colombianas con una regularidad que ya nadie cuestiona. Las reuniones de conciliación se vuelven rutina, los controles manuales no cuadran y alguien siempre tiene su propia versión del reporte. Se arman tablas en Excel para compensar lo que el sistema debería mostrar solo. El desorden no empezó en la bodega; empezó en el pedido.

La gobernanza del dato, que son los acuerdos y reglas que definen qué información entra al sistema, quién la genera y con qué criterio de calidad, casi siempre se aplaza. Se asume que es un asunto técnico que el equipo de datos resolverá por su cuenta. Pero cuando ese equipo llega, el desorden ya está adentro, y separarlo cuesta mucho más que haberlo controlado en la entrada.

Poner orden después de que todo entró sin control es como intentar clasificar materiales cuando ya van cuatro pisos construidos. Ese tiempo se resta de proyectos, de decisiones y de confianza en los reportes. Y ese era exactamente el problema que se quería resolver desde el principio.

Primero acuerde con sus áreas qué significa cada indicador clave antes de que el dato salga de la fuente. Luego asigne un responsable por cada dominio, alguien que responda por su calidad y no solo por generarlos. Después defina reglas mínimas de ingreso al sistema, como campos obligatorios y formatos estándar. Por último revise esos acuerdos cada vez que cambien los procesos, porque en la obra siempre llegan materiales nuevos.

→ Explora el hub completo de Arquitectura de Datos →

Versión en inglés

🏗️ The Data Lake Is Not Broken; The Agreements Are

When a data project goes wrong, the first thing a company points to is the platform. Maybe it was a bad choice, maybe the technical team didn't know how to implement it, maybe the vendor didn't deliver on their promises. That cycle of blame is very common in Colombia and almost always points in the wrong direction.

The chaos in a data lake starts long before the first piece of data arrives at the system.

A data lake is a central repository where a company stores large volumes of data in their original state, before processing them. Think of it like the storage area of a construction site where bricks, pipes, cables, and wood arrive without being labeled or categorized. The warehouse itself is not to blame if the project turns out poorly. The disorder starts in the order.

When the sales department sends a file where the net sales column includes returns and the finance department sends another where it doesn't, both enter the system with the same field name. The analyst spends days discovering the difference and the report comes back. It's like when bricks arrive without size specifications—they enter just like the rest and no one notices until they don't fit where they should.

This reprocessing happens in many Colombian companies with a regularity that nobody questions anymore. Reconciliation meetings become routine, manual controls don't balance out, and someone always has their own version of the report. Excel tables are built to compensate for what the system should show on its own. The disorder didn't start in the warehouse; it started in the order.

Data governance, which are the agreements and rules that define what information enters the system, who generates it, and by what quality criteria, is almost always postponed. It's assumed to be a technical issue that the data team will resolve on its own. But when that team arrives, the disorder is already inside, and separating it costs much more than controlling it at the entry point.

Putting things in order after everything entered without control is like trying to classify materials when four floors have already been built. That time is subtracted from projects, from decisions, and from confidence in reports. And that was exactly the problem you wanted to solve from the beginning.

First agree with your departments what each key indicator means before the data leaves the source. Then assign a responsible party for each domain, someone who answers for its quality and not just for generating it. Then define minimum rules for system entry, such as mandatory fields and standard formats. Finally, review those agreements each time processes change, because the construction site always receives new materials.