Lectures 2 & 3 Flashcards
(16 cards)
2.1 What role do relational databases play in data wrangling?
Pros:
• They create a structure for data making it easier to analyse, query and store
• Make it easier to clean data, maintain consistency and security, especially with multiple users
• They allow you to query data via a high level language like SQL
Cons:
• However not all forms of data can be represented in a relational database.
• It may be hard to load some forms of data into a relational database, e.g. unstructured data like text, HTML, sequences, graphs etc.
2.2 Regular Expressions. What do the following symbols mean? ., ^, $, *, +, |, []?
- ‘.’ matches any character
- ‘^’ matches start of string
- ‘$’ matches end of string
- ‘*’ zero or more repetitions
- ‘+’ one or more repetitions
- ‘|’ the “or” operator
- ‘[]’ a set of characters, e.g. [abcd] or [a-zA-Z]
Regular Expressions: How would you express the following? • matches any character • matches start of string • matches end of string • zero or more repetitions • one or more repetitions • the “or” operator • a set of characters, e.g. [abcd] or [a-zA-Z]
- ‘.’ matches any character
- ‘^’ matches start of string
- ‘$’ matches end of string
- ‘*’ zero or more repetitions
- ‘+’ one or more repetitions
- ‘|’ the “or” operator
- ‘[]’ a set of characters, e.g. [abcd] or [a-zA-Z]
2.3 What is a CSV file?
CSV: Comma Separated Values – a file that stores tabular data in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
2.3 What is a spreadsheet?
Spreadsheet: XLS files – data is stored in rows and columns like a grid and can be manipulated and used in calculations
2.3 What is the difference between a CSV file and a spreadsheet?
- CSVs lack the formatting information of spreadsheets.
- Spreadsheets are binary files that can be viewed by applications written to specifically read their format.
- Spreadsheets can contain information other than raw data e.g graphs.
2.6 Why do we use XML?
XML purpose:
• You can define your own tags which makes it extensible.
• Separates style and content (html doesn’t)
• Rigorous adherence to rules (DTD)
• Both machines and humans can read it
2.6 What is an XML namespace?
Namespace Declarations are used to qualify names with universal resource identifiers (URI’s). The name consists of two parts
– namespace:local-name
They are used for providing uniquely named elements and attributes in an XML document
2.7 What is the difference between HTML and XML?
- XML is easier for machines and humans to understand
- XML adds more meaning to the data than HTML does
- XML ailows you to create your own tags, HTML doesn’t
- XML was developed to describe data and to focalize on what the data represent. . HTML was developed to display data about to focalize on the way that data looks.
- XML is extensible, html isn’t
2.8 What is the difference between a XML attribute and an XML element? When would you use one and not the other?
- Use attributes when you have a single property to represent. Elements represent parts of objects. Example an object may have two colours and therefore colour should be an element.
- Attributes are not easily expandable (for future changes)
- Attributes are difficult to manipulate by program code
- Attributes cannot contain other elements. Elements can contain elements.
2.11 What is the purpose of using XML name spaces and why are they useful?
- They allow multiple markup languages to be combined, without having to worry about conflicts of element and attribute names.
- Reusability: You can reuse a set of tags/attributes you define across different types of xml documents.
- Modularity: If you need to add some “aspect” to your XML; adding a namespace to your xml document is simpler than changing your whole xml schema definition.
- Avoid polluting the “main” namespace: You don’t force your parser to work with a huge schema definition, just use the namespace you need to.
2.12 What is the difference between XML and JSON? Where would you use one and not the other?
- JSON is simpler
- XML is a lot more verbose. JSON doesn’t have as much formatting stuff
- XML is a language whereas JSON is a data format. XML comes with a large standard of other languages for querying and transforming e.g. XQuery
- JSON is not extensible. It is not a document markup language and so it isn’t necessary to define new tags or attributes to represent data in it
- JSON is used for web requests as it can be parsed into javascript. XML can’t.
2.14 What is the purpose of using schemas for XML and JSON data?
- We need to ensure the integrity of our data – define its expected structure and content.
- The format of the data can be specified by a schema and a document validated using schema checking software
2.15 What is the motivation behind linked data?
• Linked Data: Don’t just look at the data itself but also look at the things the data is connected to. It’s a method of publishing structured data so that it can be interlinked and become more useful through semantic queries.
2.15 What is the purpose of using JSON-LD or RDF to represent linked data?
JSON-LD: Provides mechanisms for specifying unambiguous meaning in JSON data
RDF (Resource Description Framework): This graph can be serialised as XML (don’t worry about syntax!)
2.16 Why do we have different data formats and why do we wish to transform between different formats?
- Different data formats are used for different purposes. Some formats are more human readable than others e.g. you don’t represent everything in binary for a reason.
- You may want to use data to achieve a different purpose and hence decide to transform it to a different format.