How are big data jobs

Big data jobs: who does what?

Driven by new storage technologies based on new processes such as in-memory computing, column-oriented databases or distributed programming models (map reduction), the big data topic has become more relevant, especially in larger companies. Top managers of large corporations on the technical and IT side have to deal with the megatrend and evaluate how the new technological possibilities can best be used for their area of ​​responsibility.

  1. Big data
    Companies should be clear about what data they are collecting and what results they want to achieve. Big data should be able to collect as much or all of the data as possible. In contrast to BI solutions, those responsible should not get caught up in trivialities, but should always see the big picture.
  2. Big data
    The industry association BITKOM offers a free PDF file that can be used as a guide for big data projects.
  3. Big data
    With Hadoop and HDInsight in Microsoft Azure, you can also operate big data in the Microsoft cloud.
  4. Big data
    To get involved with Hadoop and Big Data, HDInsight is the fastest way. Microsoft provides developers with an offline test environment for HDInsight.
  5. Big data
    To use big data solutions, in most cases you will need a NoSQL database in addition to existing databases such as MongoDB.
  6. Big data
    Anyone who has already dealt with Big Data and uses solutions in this area can expand the environment with further options. A large number of open source products are also available here, for example Apache Giraph.
  7. Big data
    Microsoft still has the free ebook “Introducing Microsoft Azure HDInsight” available. This offers an ideal introduction to the possibilities of Big Data, HDInsight and Hadoop, also for other platforms.
  8. Big data
    HBase can be used as a database for big data solutions. This technology is based on Google Big Table and can store very large amounts of data.
  9. Big data
    Most companies mainly use Hadoop distributions or cloud solutions for processing big data. Most of the tools and distributions are part of the Apache project. Apache Mahout allows better data management in Hadoop.
  10. Big data
    Cloud solutions with Microsoft Azure, Google Cloud Platform or Amazon Web Services are often calculated according to data volume and calculation time. Developers should therefore include shutting down and turning off big data environments in their queries and big data applications.

While companies used to use data from their own applications almost exclusively, a large number of external sources such as social media or networked devices in the Internet of Things have been added in recent years. This then leads to new job profiles - the term "data scientist" has been appearing more and more recently. This seems to be the kind of "wizard" every company needs to bring the marvel of big data to life. Like a large hydra, it seems to be the solution to all problems - something different for everyone, but always suitable. New courses are emerging that train their students to become a "Master of Data Science", and not just since the Harvard Business Review named it the "Sexiest Job of the 21st Century". But who is this hero of the present, whose job description is not all that new?

In order to provide a little insight and a more diversified picture, some terms and roles within companies are described below that are often associated with the professional field of data scientists.

(Big) data engineer

The data engineer is essentially responsible for merging data. From the available data and technologies, he creates a landscape in which the data scientist can live. His knowledge is not only limited to the data available in the company and its storage locations, he also knows how to best integrate this data into a central analysis infrastructure, which technologies are suitable for this and which additional external data can be used for enrichment .

He becomes a big data engineer when he works with large amounts of data that require big data technologies for storage and processing. The delimitation of big data is not strictly defined - but large amounts of data can be, for example, one million sales transactions by an online retailer or one million hosted phone calls by a telecommunications provider. But also a sensor that produces 50 megabytes of data every two nanoseconds. His performance begins with the understanding of the technical requirements and the planning and development of a robust and flexible big data infrastructure (also known as a big data architect), continues with the connection of internal and external data sources via batch, real-time and streaming Interfaces up to ensuring smooth operation and up-to-dateness of the data. He's basically the stadium architect, greenkeeper and kit manager for the soccer team. The (Big) Data Engineer is the master of the data supply.

Management Scientist

The management scientist, on the other hand, is more like the manager or team boss in order to stay in the picture of the soccer team. He is the first to be on site, analyzes the situation and discusses the technical problems that are to be solved with the help of data analyzes. With the growing popularity of data-driven decision support, there is hardly a technical area or industry today in which data analysis is not used.

The service of the management scientist consists in translating the language of the technically and data-ignorant specialist into that of the data scientist. It starts with the specification of the actual technical problem definition, the translation and sharpening of the underlying analytical question, continues with the identification of required data, the management of the operational analysis and the communication of analytical results and recommendations for action. For his job, the management scientist needs a good understanding of analytical methods and procedures as well as technical processes and effects. He needs a certain understanding of the specialist areas in order to understand the specialist representative and to explain the problem to the data scientist, as well as the ability to evaluate analytical results and to make the procedure and results palatable to the specialist representative in his language. The management scientist is the mediator between two worlds.