How tech giants like Google, Facebook, Amazon store the world’s data

Sakshi Tripathi
8 min readSep 17, 2020

--

So lets start the analysis but before let me create a scenario to understand this concept even better…………..

In the evening of April 11, 2012, Mike Krieger, the co-founder of Instagram, gave a talk in San Francisco. What he must have known, though his audience didn’t, was that in less than 24 hours, Mark Zuckerberg would announce that he was buying Instagram for a billion dollars.

Given that he was about to become A Very Rich Man, Krieger hid his excitement well. The talk called “Scaling Instagram” was a long and technical one about the challenges of growing the popular photo app. One of the final slides (number 176 of 185) had just two words: “Unprecedented Times”. The next one said: “2 backend engineers can scale a system to 30+ million users.” On the eve of the Facebook deal, Instagram had bumped up that total to five engineers.

If this doesn’t sound quite as remarkable as it should,………

It’s because we are now inured to Silicon Valley startup stories originating in garages with a couple of geeks, who end up making millions — or a billion.

Look at this ‘growth’ in a different way then: it’s as if Steve Jobs and Steve Wozniak built the first Apple computer and within a day or two of building that first prototype, were able to ship several thousand models to customers

So you might retort of course, that software is different. You can ‘scale’ bits and bytes in a way you can’t with metal or glass or plastic. Fair enough🤔

So imagine this scenario:

you create a piece of software which allows millions of users to upload millions of photographs and tweak them and tag them and share them. You need someplace to store those photographs and you need to be able to handle thousands or millions of users swarming over your website or app everyday without it going down.

In a world where competition is intense, users will simply dump you, if your app slows down or freezes. So your ‘downtime’ has to be pretty much close to zero. For the user, whether they are in New York, or Tokyo, or Ankara or Mumbai, you have to be always up, and always running, 24 hours a day.

How to compete in this fast going era?

How to maintain such huge amount of data?

Big Data

Big data comes and is composed through electronics operations from multiple sources. It requires proper processing power and high capabilities for analysis. The importance of big data lies in the analytical use which can help generate an informed decision to provide better and faster services The term big data is called on the huge amount of high-speed big data of different types; this data cannot be processed and stored in regular computers. The main characteristics of big data, called V’s , can be summed up in the fact that the issue is not only about the volume of data.

Volume:

It represents the amount of data produced from multiple sources which show the huge data in numbers by zeta bytes. The volume is most evident dimension in what concerns to big data.

Variety:

It represents data types, with, increasing the number of Internet users everywhere, smart phones and social networks users, the familiar form of data has changed from structured data in databases to unstructured data that includes a large number of formats such as images, audio and video clips, SMS, and GPS data .

Velocity:

It represents the speed of data frequency from different sources, that is, the speed of data production such as Twitter and Facebook. The huge increase in data volume and their frequency dictates the need for a system that ensures super-speed data analysis.

Veracity:

It represents the quality of the data, it shows the accuracy of the data and the confidence in the data content. The quality of the data captured can vary greatly, which affects the accuracy of analysis. Although there is wide agreement on the potential value of big data, the data is almost worthless if it is not accurate.

Difference between traditional data and big data

First technology we can use for the management of BIG-DATA is:-

Managing Big Data with Cloud Computing

It is a term that refers to on-demand computer resources and systems that can provide a number of integrated computer services without being bound by local resources to facilitate user access. These resources include data storage, backup and self-synchronization, as well as software processing and scheduling tasks Cloud computing is a shared resource system that can offer a variety of online services such as virtual server storage, and applications and licensing for desktop applications. By leveraging common resources, cloud computing is able to achieve expansion and provide volume.

THE RELATIONSHIP BETWEEN THE CLOUD AND BIG DATA

Cloud computing is a trend in the development of technology, as the development of technology has led to the rapid development of electronic information society. This leads to the phenomenon of big data and the rapid increase in big data is a problem that may face the development of electronic information society. Cloud computing and big data go together, as big data is concerned with storage capacity in the cloud system, cloud computing uses huge computing and storage resources. Thus, by providing big data application with computing capability, big data stimulate and accelerate the development of cloud computing. The distributed storage technology in environmental computing helps to manage big data.

☁ “The cloud has enabled us to be more efficient, to try out new experiments at a very low cost, and enabled us to grow the site very dramatically while maintaining a very small team,” ☁

Pinterest operations engineer Ryan Park told a conference in New York.

Big data processing and storage require expansion as the cloud provides expansion through virtual machines and helps big data evolve and become accessible. This is a consistent relationship between them. Google, IBM, Amazon and Microsoft are examples of the success in using big data in the cloud environment. In order for the cloud environment to fit with big data the cloud computing environment must be modified to suit data and cloud work together. Many changes are needed to be made on the cloud: CPUs to handle big data and others.

Second technology we can use for the management of BIG-DATA is:-

Managing Big Data with Hadoop

Hadoop, an open-source software framework, uses HDFS (the Hadoop Distributed File System) and MapReduce to analyze big data on clusters of commodity hardware — that is, in a distributed computing environment.

Initially designed in 2006, Hadoop is an amazing software particularly adapted for managing and analysis big-data in structured and unstructured forms. The creators of Hadoop developed an open source technology based on input, which included technical papers that were written by Google. The initial design of Hadoop has undergone several modifications to become the go-to data management tool it is today.

As organizations began to use the tool, they also contributed to its development. The first organization that applied this tool is Yahoo; other organizations within the Internet space followed suit shortly. The other organizations that applied Hadoop in their operations include Facebook, Twitter, and LinkedIn, all of which contributed to the development of the tool.

Today, Hadoop is a framework that comprises tools and components offered by a range of vendors. The wide variety of tools and compartments that make up Hadoop are based on the expansion of the basic framework.

How Hadoop handles big data………

The Hadoop Distributed File System, like the name suggests, is the component that is responsible for the basic distribution of data across the system of storage, which is a DataNode. This component is behind the directory of file storage as well as the file system that directs the storage of data within nodes.

Applications run concurrently on the Hadoop framework; the YARN component is in charge of ensuring that resources are appropriately distributed to running applications. This component of the Hadoop framework is also responsible for creating the schedule of jobs that run concurrently.

The MapReduce component of Hadoop tools directs the order of batch applications. This component is in charge of the parallel execution of batch applications. The Hadoop Common component of Hadoop tools serves as a resource that is utilized by the other components of the framework. Hadoop Ozone is a component that provides the technology that drives object store, while Hadoop Submarine is the component that drives machine learning. Hadoop Submarine and Hadoop Ozone are some of the newest technologies that are components of Hadoop.

Thus, Hadoop processes a large volume of data with its different components performing their roles within an environment that provides the supporting structure. Apart from the components mentioned above, one also has access to certain other tools as part of their Hadoop stack. These tools include the database management system, Apache HBase, and tools for data management and application execution. Development, management, and execution tools could also be part of a Hadoop stack. The tools typically applied by an organization on the Hadoop framework are dependent on the needs of the organization.

Hadoop is just one element of the continuing revolution in data management, just as G-Drive and Dropbox and others represent the consumer side. Expect more radical innovations and more tricky questions.

So here is all the stuff , i hope my blog would help u all to understand the concept on BIG-DATA and its management

THANK YOU !!!!

--

--

Sakshi Tripathi

Redhat Ansible || Flutter || EKS || Hybrid_Multi_Cloud || MLOps Trainee at LinuxWorld Informatics Pvt Ltd, Pursuing my ambition and gaining knowledge