Lectures 2 & 3 Flashcards

1
Q

2.1 What role do relational databases play in data wrangling?

A

Pros:
• They create a structure for data making it easier to analyse, query and store
• Make it easier to clean data, maintain consistency and security, especially with multiple users
• They allow you to query data via a high level language like SQL
Cons:
• However not all forms of data can be represented in a relational database.
• It may be hard to load some forms of data into a relational database, e.g. unstructured data like text, HTML, sequences, graphs etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2.2 Regular Expressions. What do the following symbols mean? ., ^, $, *, +, |, []?

A
  • ‘.’ matches any character
  • ‘^’ matches start of string
  • ‘$’ matches end of string
  • ‘*’ zero or more repetitions
  • ‘+’ one or more repetitions
  • ‘|’ the “or” operator
  • ‘[]’ a set of characters, e.g. [abcd] or [a-zA-Z]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
Regular Expressions: How would you express the following?
•	matches any character
•	matches start of string
•	matches end of string
•	 zero or more repetitions
•	one or more repetitions
•	the “or” operator
•	a set of characters, e.g. [abcd] or [a-zA-Z]
A
  • ‘.’ matches any character
  • ‘^’ matches start of string
  • ‘$’ matches end of string
  • ‘*’ zero or more repetitions
  • ‘+’ one or more repetitions
  • ‘|’ the “or” operator
  • ‘[]’ a set of characters, e.g. [abcd] or [a-zA-Z]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

2.3 What is a CSV file?

A

CSV: Comma Separated Values – a file that stores tabular data in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2.3 What is a spreadsheet?

A

Spreadsheet: XLS files – data is stored in rows and columns like a grid and can be manipulated and used in calculations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

2.3 What is the difference between a CSV file and a spreadsheet?

A
  • CSVs lack the formatting information of spreadsheets.
  • Spreadsheets are binary files that can be viewed by applications written to specifically read their format.
  • Spreadsheets can contain information other than raw data e.g graphs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

2.6 Why do we use XML?

A

XML purpose:
• You can define your own tags which makes it extensible.
• Separates style and content (html doesn’t)
• Rigorous adherence to rules (DTD)
• Both machines and humans can read it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2.6 What is an XML namespace?

A

Namespace Declarations are used to qualify names with universal resource identifiers (URI’s). The name consists of two parts
– namespace:local-name
They are used for providing uniquely named elements and attributes in an XML document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2.7 What is the difference between HTML and XML?

A
  • XML is easier for machines and humans to understand
  • XML adds more meaning to the data than HTML does
  • XML ailows you to create your own tags, HTML doesn’t
  • XML was developed to describe data and to focalize on what the data represent. . HTML was developed to display data about to focalize on the way that data looks.
  • XML is extensible, html isn’t
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

2.8 What is the difference between a XML attribute and an XML element? When would you use one and not the other?

A
  • Use attributes when you have a single property to represent. Elements represent parts of objects. Example an object may have two colours and therefore colour should be an element.
  • Attributes are not easily expandable (for future changes)
  • Attributes are difficult to manipulate by program code
  • Attributes cannot contain other elements. Elements can contain elements.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

2.11 What is the purpose of using XML name spaces and why are they useful?

A
  • They allow multiple markup languages to be combined, without having to worry about conflicts of element and attribute names.
  • Reusability: You can reuse a set of tags/attributes you define across different types of xml documents.
  • Modularity: If you need to add some “aspect” to your XML; adding a namespace to your xml document is simpler than changing your whole xml schema definition.
  • Avoid polluting the “main” namespace: You don’t force your parser to work with a huge schema definition, just use the namespace you need to.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

2.12 What is the difference between XML and JSON? Where would you use one and not the other?

A
  • JSON is simpler
  • XML is a lot more verbose. JSON doesn’t have as much formatting stuff
  • XML is a language whereas JSON is a data format. XML comes with a large standard of other languages for querying and transforming e.g. XQuery
  • JSON is not extensible. It is not a document markup language and so it isn’t necessary to define new tags or attributes to represent data in it
  • JSON is used for web requests as it can be parsed into javascript. XML can’t.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

2.14 What is the purpose of using schemas for XML and JSON data?

A
  • We need to ensure the integrity of our data – define its expected structure and content.
  • The format of the data can be specified by a schema and a document validated using schema checking software
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

2.15 What is the motivation behind linked data?

A

• Linked Data: Don’t just look at the data itself but also look at the things the data is connected to. It’s a method of publishing structured data so that it can be interlinked and become more useful through semantic queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

2.15 What is the purpose of using JSON-LD or RDF to represent linked data?

A

JSON-LD: Provides mechanisms for specifying unambiguous meaning in JSON data
RDF (Resource Description Framework): This graph can be serialised as XML (don’t worry about syntax!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

2.16 Why do we have different data formats and why do we wish to transform between different formats?

A
  • Different data formats are used for different purposes. Some formats are more human readable than others e.g. you don’t represent everything in binary for a reason.
  • You may want to use data to achieve a different purpose and hence decide to transform it to a different format.