Lecture 28 Big Data Flashcards
Define big data
Big data is a very generic term to indicate datasets that are so large or complex that traditional data processing applications (e.g. desktop computer, small server or statistical tools normally used in a small scale) are inadequate for mining it.
What are the 3V of big data
- High volume
- High velocity
- High variety
Recently which 3 more V have been added?
- Highly variable
- High veracity (variation in quality)
- High value (which comes with complexity)
What is big data a combination of what sort of data
- unstructured
- semi-structured
- structured
Why is data collected?
It is collected so that it can be mined and used to build predictive models and other advanced analytics applications.
What is structured data?
Able to catalogue
What is unstructured data?
Behavioural or ambiguous data
Does big data associate to any specific volume of data?
No it can be deployed in terabytes (TB), petabytes (PB) and even exabytes (EB) of data, captured over time.
Whys is big data important to companies?
- use it to improve operations
- provide better customer service
- create personalized marketing campaigns
- faster decisions
- more-informed decisions
- they can become more customer-centric
Examples of big data?
- Business transaction systems
- Customer databases
- Medical records
- Internet clickstream logs
- Mobile applications
- Social networks
Examples of big data in SCIENCE?
- Scientific research repositories
- Machine-generated data
- Clinical records e.g. life-style, not just medical records
How is data left?
The data may be left in its raw form in big servers or preprocessed using data mining tools or data preparation software to be analysed e.g. Google/Amazon
How is Big Data used in Life Science?
- allows identification of risk factors in disease
- helps diagnose illnesses and conditions in individual patients
What is Big Data derived from?
Big data is derived from genomics, transcriptomes and epigenomics (OMICS) data of many individuals. It is also derived from electronic health records, social media, the web and other sources provides healthcare organisations and government agencies with up-to-the-minute information on infectious disease threats or outbreaks.
How is big data being used in the COVID-19 pandemic?
AI and big data and playing a key role in modelling as well as making predictions for the effect of the measures enforces as well as the science of the virus itself.
What has poor prediction power on its own?
Genomics
What else is used to increase the predictive power?
Lifestyle and environmental predictions
Describe the abdominal aortic aneurysm example of Big Data
- Genome is taken alongside lifestyle and physiology
- Their genes, how these genes are activated, mutations and health records build HEAL: a machine learning framework.
- This is carried out on many people, a prediction about the predisposition of individuals to the disease.
- Predisposition (genome) and lifestyle are balanced against each other before management of heath is directed
- Genes are identified and their specific pathways
Eventually, the data is arranged to show the factors associated to risk. Red – genome, blue – eco, yellow – mixture. The closer the number is to 1, the better at predicting the model is. It is evident genome has a low predictive power compared to lifestyle in this case.
Describe the challenges of Big Data in life science
- Data analysis
- Data curation i.e. making it as high quality as possible
- Searching engines i.e. so they’re powerful enough to search for the full/whole data points
- Data sharing is needed from many different labs to build up big data
- Data storage and transfer
- Data visualisation
- Information privacy
Advantage of big data
High predictive power
Why does big data lead to more confident decision making
High accuracy
In biology what has occurred as a result of high-throughput genomics?
Life scientists are starting to grapple with massive datasets, encountering big data challenges
Define data science
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data
Define machine learning
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.