Big Data for Dummies Flashcards
(36 cards)
What are the three Vs of big data?
Extremely large volumes of data; extremely high velocity of data, extremely wide variety of data
Why is big data important?
Enables organizations to gather, store, manage and maniuplate vast amounts of data at the right speed, the at the right time, to gain the right insights.
Data warehouses vs. data marts
Data warehouses can be too complex and large and didn’t offer the speed and agility that the business required. The answer was a further refinement of hte data being managed through data marts. Data marts were focused on specific business issues and more streamlined, supporting the business need for speedy queries.
Data warehouses are typically fed in…
Batch intervals, like daily or weekly. Limits in real-time business and consumer environments
What is a BLOB
Binary large objects – stores an unstructured data element. ODMS (object database management system) stores the BLOB as an addressable set of pieces so that we could see what was in there.
What is advantage of object database
Includes a programming language and a structure for the data elements so that it is easier to manipulate various data objects without programming and complex joins.
What are some of the technologies a the heart of big data? (4)
- Virtualization 2. Parallel processing 3. Distributed file systems 4. In-memory databases
Different approaches to handling data exist based on whether it is data in motion or data at rest. What is data in motion vs data at rest
Data in motion would be used if a company is able to analyze the quality of its products during the manufacturing process to avoid costly errors. Data at rest would be used by a business analyst to better understand customers’ current buying patterns.
Is big data a single technology? What does it help companies gain?
Big data is a combo of old and new technologies that helps companies gain actionable insight.
What are the 5 components of the cycle of big data management
- Capture 2. Organize 3. Integrate 4. Analyze 5. Act
Why is validation an important issue in big data management
If your organization is combining data sources, it is critical that you have the ability to validate that these sources make sense when combined. Also, certain data sources may contain sensitive information, so you must implement sufficient levels of security and governance.
Where would you start in big data management?
Start with the problem you’re trying to solve. That will dictate the kind of data that you need and what the architecture might look like.
How do you determine what performance requirements will be when setting up a big data management system?
Your needs will depend on the nature of hte analysis you are supporting. You will need the right amount of computational power and speed. Some analysis will be real time but you will be storing some amount of data as well. -How much data will my organization need to manage today and in the future? -How often will my organization need to manage data in real time or near real time? -How much risk can my organization afford? Is my industry subject to strict security, compliance and governance requirements?
Why do you need redundancy in your data management system?
So you are protected from unanticipated latency and downtime
What is in a big data tech stack?
What makes big data big?
It relies on picking up lots of data from lots of sources.
Why are APIs important in the big data stack?
To get massive amounts of data in, you need integration. Open application programming interfaces (APIs) will be core to any big data architecture. Interfaces exist at every level and between every layer of hte stack. Without integration services, big data can’t happen.
Why does big data need different infrastructure than traditional data
To support an unanticipated or unpredictable volume of data. So it’s based on a distributed computing model. This means that data may be physically stored in many different locations and can be linked together through networks, the use of a distrihbuted file system, and various big data analytic tools and applications.
What is a distributed computing model
This means that data may be physically stored in many different locations and can be linked together through networks, the use of a distrihbuted file system, and various big data analytic tools and applications.
Why is redundant physical infrastructure important
Because we’re dealing with so much data from so many different sources. Redundancy comes in many forms. If your company has created a private cloud, you will want to have redundancy built within th eprivate environment so that it can scale out to support changing workloads. In some cases, this redundancy may come in the form of a Software as a Service (SaaS) offering that allows companies to do sophisticated data analysis as a service.
Why use SaaS for redundant physical infrastructure?
Lower costs, quicker startup and seamless evolution of the underlying tech
Security infrastrcuture is important why
If you have to comply with regulations or keep customer info secure you will need to take into account who is allowed to see the data and under what circumstances they are allowed to do so.
What is an operational data source
In big data you have to incorporate all the data sources that will give you a complete picture of your business and see how the data impacts the way you operate your business. In the past this was highly structured data managed in a relational database. But operational data now has to encompass a broader set of data sources, including unstructured sources such as customer and social media data
What characteristics does a good operational data source have?
- Represent systems of record that keep track of the critical data required for real-time, day to day operation of the business
- Continually updated based on transactions wherever they take place
- Blend structured and unstructured data
- System that scales to support many users on a consistent basis.