About

More and more organizations begin to establish so called data lakes, which combine all internally available data sources in a common storage and processing environment. In practice, data lakes often lack a structured curation process. For example, neither global schemas nor vocabularies are established. Data sources are usually disparate as no associations are defined.

To make sense of the data, these curation steps must be implicitly applied during querying. Several tools enable querying relational and semi-structured data sources using SQL. The resulting SQL query logs constitute a dynamic documentation of the data lake. They contain knowledge about the purpose of data sources, their semantics, vocabularies, associations with other data sources, and their temporal and social usage context.

Such knowledge is informally shared within a specific team of data scientists, but usually is neither formalized nor shared with other teams. Potential synergies across an organization remain unused. Hence, we introduce our novel approach of extending existing data management systems with additional capabilities for knowledge-sharing. OCEAN facilitates user collaboration without altering established data analysis workflows and abandoning existing BI tools. Relevant knowledge from the query log is extracted to support data source discovery and incremental data integration. This knowledge is formalized to enable its sharing across different teams of data scientists.