Chapter 1 Flashcards
(112 cards)
What has allowed for the development of data science?
Cheaper data storage, faster hardware and advances in algorithms
Define data science, data analytics and data mining
Data Science: The interdisciplinary field that uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data.
Data Analytics: The process of examining data sets to uncover patterns, trends, and insights, often with the aim of making informed business decisions.
Data Mining: The practice of discovering hidden patterns and relationships in large datasets using statistical and machine learning techniques.
What is data mining?
Data mining refers to extracting or “mining” knowledge from large amounts of data.
Includes sophisticated algorithms for analysing data that can’t be analysed manually.
It involves selecting, exploring and modelling large quantities of data to discover patterns or relations. These patterns are at first unknown. We want to obtain clear and useful results for the owner of the database (using previous data to make a predictions).
What are other terms for data mining?
Knowledge discovery in databases (KDD)
Knowledge mining from databases
Knowledge discovery
Knowledge extraction
Data/pattern analysis
What are some examples of large datasets that may require data mining techniques?
- Supermarket information on transactions and customers
- NASA Earth Observation System
- Modern biology information on human genomes etc.
What is data?
What are some examples of data?
Any facts, numbers, text, images, audio, maps etc. that can be processed by a computer.
Data is information such as facts and numbers used to analyse something or make decisions.
Examples include:
- Transactions in supermarkets
- Stock market figures
- Words or sentences in a book
- Road layouts
- Satellite images
What is a defining characteristic of data mining compared to other methods?
Data mining is data driven, whereas other methods are often model driven
What is the difference in the size of the dataset used in statistics vs data mining?
In statistics, want to find the smallest data size that gives sufficiently confident estimates.
In data mining, we want the data size to be large and we are interested in building a model that is small (not too complex) but still describes the data well.
What kind of techniques/disciplines are involved in data mining?
- Database technology
- Statistic
- Machine learning
- High-performance computing
- Pattern recognition
- Data visualisation
- Information retrieval
What do data mining, machine learning and deep learning have in common?
They have the same goal, to extract insights, patterns and relationships that can be used to make decisions.
But they have different approaches and abilities.
Describe data mining
Data mining can be considered a superset of many different methods to extract insights from data.
Data mining applies methods from many different areas to identify previously unknown patterns from data.
This can include:
- Statistical algorithms
- Machine learning
- Text analysis
- Time series analysis
- Other areas of analytics
DM also includes the study and practice of data storage and data manipulation.
Describe machine learning
Just like statistical models, the goal is to understand the structure of the data. ML has developed based on the ability to use computers to probe the data for structure, even if we do not have a theory of what the structure looks like.
The test for a ML model is a validation error on new data, not a theoretical test that proves a null hypothesis.
ML often uses an iterative approach to learn from data, so the learning can be easily automated. Passes are run through the data until a robust pattern is found.
Describe deep learning
Combines advances in computing power and special types of neural networks to learn complicate patterns in large amounts of data.
DL techniques are currently state of the art for identifying objects in images and words in sounds.
Future uses: automatic language translation and medical diagnoses
What were the steps in the development of data mining?
< 1960 - data collection and database creation (simplistic filing systems)
1970s - 1980s - database management systems (developed more hierarchical database systems, databases consisting of tables where each one is assigned a name. Organising)
1980s - present - advanced database systems (developed further for data retrieval)
1980s - present - web-based database systems (with the internet boom)
1980s - present - data warehousing and data mining (to uncover previously unknown patterns in the data)
What does data rich but information poor elude to?
There has been a dramatic increase in the amount of stored data.
This far exceeds human ability for comprehension without powerful tools.
What does target marketing involve?
The supermarket finds patterns (clusters) of similar customers and targets these people with certain products. This is more cost effective than simply sending products to all customers.
Transaction data can model customer purchase patterns over time.
Customer similarity:
- Spending habits
- Income
- Interests
- Shopping patterns
What does cross-market analysis involve?
The supermarket finding patterns (associations) between products and marketing accordingly.
eg associations between products
This is often referred to as market basket analysis.
Need to understand the data and the direction of the relationship (eg beer and diapers example).
What are the three steps of customer relationship management?
1 - acquiring new customers
2 - increasing the value of the customer
3 - retaining good customers
eg banks
What kinds of techniques may identify customers who would pose less risk?
Classification or scoring techniques
What technique may help the bank identify customer needs and offer linked products, increasing bank revenue?
Cross market analysis
By identifying customers that are most profitable, or who will be most profitable in the future, they aim to retain them.
Define financial analysis
The identification of financial trends and patterns over time (time series analysis) to ensure organisations can maximise profit
eg the stock market, business profit
Define competition
The ability to monitor competitors and market directions. The ability to set price strategies in a highly competitive market.
Define forecasting
The identification of patterns, predicting future events by analysing past and presence data and trends.
Eg weather forecasting identifies weather patterns.
Classification - rain or sun
Continuous - temperature etc.
Define fraud detection
A database may contain objects that do not comply with the general behaviour or model of the data. Data mining can involve identifying patterns and hence identifying any outliers (anomaly discovery)
Distance measures where objects that are a substantial distance from any other cluster are considered outliers.