Beyond relational data Flashcards

1
Q

What is semi-structured data?

A

Semi-structured data lies in between fully structured data (like relational databases) and entirely unstructured database (arbitrary data files)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is fully structured data?

A

Data that fits a strong schema, which allows you to make highly efficient queries possible but you do need highly specific shapes/structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is unstructured data?

A
  • Can store basically anything, pictures, music, arbitrary text files
  • No precise description of the structure of this data which means that programs that work with these kinds of files need to know exactly how to extract the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is semi-structured data?

A

In between the two extremes you have semistructured data
Tries to pick best features of both extremes, has lots of flexibility but no schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe a semistructured data model

A
  • End up with a tree like structure, these don’t need to be trees however they can just be paths.
  • Like B+ trees, data is found in the leaves. However it’s not as good at searching and doesn’t have strong balancing properties
  • Each edge has a label and the label defines the relationship between the two nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe each of the elements that make up a semi-structured data tree like model

A
  • leaf nodes: have associated data
  • Inner nodes: have edges going to other nodes
  • Root: no incoming edges
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What advantage do semi-structured data models have over structured data models?

A
  • As it’s semi-structured we can include some data but not others,
  • It’s not a requirement that each node has the same kind of property as every other node,
  • We can very easily add in attributes, by just traversing to the correct place in the tree and adding a node
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is semi-structured data useful for storing?

A
  • Often used for sharing things between companies over the internet
  • useful for storing documents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some of the forms for storing semi-structured data?

A

XML, JSON, KEY-VALUE, Graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Order these types of databases from fastest to slowest for accessing data: XML, JSON, Key-value, relational database

A
  • Key-value
  • relational database
  • XML, JSON
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the structure of an XML document

A
  • the first line says that it’s an XML file, so it’s XML version 1.0, encoding UTF-8 and standalone = yes, standalone meaning that we don’t have a schema for the file.
  • inside we have a bunch of lecturers with tags around them
  • opening tags have no slash inside them, closing tags have a slash inside them
  • so in between tags we have an element
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is not a problem in XML but is in file systems?

A
  • Can think of the tree as a file system, with each node as folders. - Children can however have the same name and this would be a problem in a normal file, however is fine in XML because when we query we search all paths that satisfy the condition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can XML trees not have?

A

in XML, we can’t have nodes with multiple parents because XML files are always trees
- We can have references in trees though, that say this node points to this other node and it’s basically how shortcuts are done in a file system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the form for an XML element?

A
  • XML files are made up of a bunch of elements
  • we have opening and closing tags and in between some arbitrary text (an element)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do you do if you want to leave an element empty?

A

You just combine the opening and closing tags by writing <keyword></keyword>
- elements are case sensitive so the keywords defining them must be the same

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How are attributes defined in elements in XML documents?

A
  • Write opening tag then the attribute name = value, then another attribute name = value (if there’s more than one) then the closing tag
  • each attribute can only have one attribute per name, you can have as many attributes as you like but they must all be uniquely named
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When should something be an attribute and when should something just be another element

A
  • staff ID can be either because there’s only one ID, whereas you shouldn’t use email addresses as attributes because there could be multiple email addresses and you’re not allowed to write more than one value for an attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When should something be an attribute and when should something just be another element

A
  • staff ID can be either because there’s only one ID, whereas you shouldn’t use email addresses as attributes because there could be multiple email addresses and you’re not allowed to write more than one value for an attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is document order?

A
  • Document order defines how XML files are ordered - they’re just ordered how they appear in the file. Whichever element comes first in the physical file, is what’s first
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a DTD?

A

Document type definition or XML schema are used to define a schema for your XML files, this must be done at the start of the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are Entity references

A

Entity references are basically the shortcuts, so if you wanted to say that two elements were both members of a group, you need to point to one of them instead of writing them on both of them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why do we use Entity references

A

We do this because if you just read it as a file, then this could insinuate that there are two different groups instead of two places pointing to the same group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is CDATA used for?

A

For passing information onto the processor or the application being used by you XML file for
-for example if you want to use < or > inside your text then you need to define that the XML processor knows this isn’t an error - this can be done with CDATA sections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a good way of defining format of an XML file

A
  • A DTD such as a schema of an XML file
  • DTD provides information about the structure of your XML documents such as what elements may occur, what sub elements may occur inside an element and what attributes we have.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the value of standalone in the first line when we have a DTD for the XML file?

A

We set it to “no” because we do have a schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the meanings of the symbols +, *, ?

A

+ means 1 or more of an item
* means 0 or more of an item
? means 0 or 1 of an item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q
  • How do we define that an element must have an attribute in a DTD?
A
  • first you define the element without the attribute by writing <!ELEMENT module EMPTY>
  • then we define the attributes, by writing <!ATTLIST module code CDATA #IMPLIED>
    and <!ATTLIST module title CDATA #IMPLIED>
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does #IMPLIED mean in DTD

A
  • # IMPLIED means that the data is optional
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How do we say a specific an element can’t be left empty in the DTD

A
  • # REQUIRED if this attribute can’t be left empty or Some value “COMPXXX” as a default value or #FIXED and some value if it’s a constant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are IDREF/IDREFS?

A
  • IDREF references one element
  • IDREFS references a list of elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What does ID allow you to define?

A

ID allows you to define a unique key to be associated with an element that you can use to point to this element later using IDREF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the two levels of document processing?

A
  • Well formed and valid
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What does a non-validating processor ensure?

A

That an XML document is well formed before passing information on to an application

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What features does a well formed document have?

A
  • all elements must be within one root element
  • elements must be nested in a tree structure without any overlaps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is XPath?

A

XPath is basically an (advanced) “file” path in XML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What does XPath do when we have multiple children with the same name?

A
  • You can return all of them
  • Or you can return, the first, last or ith item depending on what you want
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does XPath allow us to do?

A
  • write queries that return a set of values or nodes from an XML document
  • values are string, ints, reals, etc
  • nodes are the entire document, an element node or an attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the format for the most basic path in XPath?

A
  • most basic path looks like this /E1/E2/E3/…En
  • this is a slash then the name of an attribute then a slash and so on until you reach a desired attribute
  • Whatever you reach by traversing down the path is what is returned.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

If there is more than one result when traversing a an XML document tree what is the order of the results returned?

A
  • The result is returned in document order, so it will return them in the order they’re written in the x-file
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is a relative path expression in XPath?

A
  • If you don’t start the expression with a / then it will evaluate it relative to the node, so it will return anything below the given node
  • so if we put student/name it will start at student as opposed to the root node students
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How do we return attributes in XPath queries?

A
  • You write a command like before but in the last tag your write an @ and then the name of an attribute and it will output whatever attribute you wrote in there
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What does * mean in an XPath query

A

Use this * wildcard symbol to return anything directly below the named attribute
So /students/student/* will return the program code, module code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the general form of an XPath expression

A

/axis::E1/axis::E2/axis::E3/ then a tag name, attribute name or *

44
Q

What are common axes?

A
  • @ for attribute:: (returns an attribute)
  • child is the default so you don’t have to write it
  • //E can replace descendant::
  • .. is short for parent::
  • . is short for self
45
Q

What is a proper descendant?

A

A descendant is just everything you can reach by just going one step away from you

46
Q

What is a descendant-or-self?

A

A descendant-or-self is everything you can reach by just going zero or more steps away from you

47
Q

What is a preceding-sibling?

A
  • They have the same parent and come before in document order
48
Q

What is a following sibling

A
  • To be a following sibling, it’s still important you have the same parent! But you just come after the self node in document order.
48
Q

What is a following sibling

A
  • To be a following sibling, it’s still important you have the same parent! But you just come after the self node in document order.
49
Q

What is the general form of an XPath expression with conditions?

A

/axis::E1[C1]/axis::E2[C1]/axis::E3[C1]/ then a tag name, attribute name or *

50
Q

What type of conditions can we include in XPath query?

A

We can do comparisons and do combinations of these using binary operators

51
Q

What does where category[1] return

A

means that if a book belongs to multiple categories it will take the first category in the document order and return it

52
Q

What does last() return?

A

last() returns the last child node which can be useful if we want the biggest item in a list as this is the way data is ordered

53
Q

What does ancestor*[2] return?

A

finds your parent’s parent

54
Q

What does E[XPath] return?

A

Returns true if the xpath returns a non-empty set

55
Q

What does E[@price] return?

A

returns a book if it has a price attribute

56
Q

What does E[string] return?

A

If you write a string inside one of the square brackets it returns true if the string is not empty so if you write/data() in the condition, it checks that the string is not empty

57
Q

What is the role of XPath?

A

It’s not really used as a query language. It’s just used to define paths in your documents

58
Q

What is XQuery?

A
  • a language for finding and extracting elements and attributes from XML documents
59
Q

What is the simplest kind of XQuery?

A

XPaths

60
Q

What is the general format of an XQuery?

A

FLWR

61
Q

What does each part of FLWR do?

A

L - let this doc be the contents of ..
F - for clause iterates over whatever element you’re looking at, if there’s multiples of them it assigns the first one to the first one and then to the second one, then to the third one and so on
W - Where clause. So here we’re saying, where the E = …
R - Return clause where we say what we want to return each time we look at some element in out for loops or just once if there’s no for loop. In this case we want to return our element name.

62
Q

What parts of FLWR match with FROM, WHERE and SELECT in XQuery?

A

Let and for = FROM
where = WHERE
Return = SELECT

63
Q

What does it mean that XQuery is case-sensitive?

A

you must write everything in lowercase letters

64
Q

Which parts of FLWR are mandatory?

A

F - optional
L - optional
W - optional
R - mandatory

65
Q

What does ‘let doc$:= XQuery expression’ do?

A
  • It lets you assign your document to a variable
66
Q

In XQuery what do all variable names start with?

A

In XQuery, all variable names start with a dollar sign $

67
Q

Describe for clauses in XQuery?

A

It just executes whatever comes after the for clause. This could be more for clauses, or a let or where and then finally a return clause.

68
Q

What does the XQuery For $s in $doc/students/student do?

A

It will iterate through every student in the database

69
Q

Describe where clauses in XQuery?

A

So you just have some condition inside the where clause, that evaluates the condition and if it’s true then it executes the corresponding return clause
- just wrote the keyword where

70
Q

What does it mean that XQuery has Existential semantics

A

Existential semantics in XQuery refers to the ability to query for the existence of elements or attributes in an XML document, rather than their specific values. What is returned is true rather than the data itself.

71
Q

What happen if we compare two expressions

A

this still uses existential semantics, so for $s1/name = $s2/name, if both sides can find the same name, then it returns true, if one of the two or neither can then it doesn’t return it.

72
Q

What does the some keyword return

A
  • We can also use some to say return true if some $variables in the XQuery expression satisfies the condition
73
Q

What does the every keyword return

A
  • We can also use every to say return true if every $variable in the XQuery expression must satisfy the condition
74
Q

How can we get an XQuery to output multiple items?

A

We can fix this by writing the keyword <pair> as now this is only returning one item (similar to returning a tuple in Java)
- and by putting curly brackets around each item that you want to return and a comma between them</pair>

75
Q

What can we use nested loops with the <pair> keyword to do?</pair>

A

We want to return each title and the authors separate if there was more than one so we use nested loops.

So here we loop over the books, then loop over the authors for each book and then output the pair of the title and then the name.

76
Q

How does the Order by clause fit into FLWR

A

Can also go Order by like in SQL, so we insert the order by clause in just above the return clause
- just put keyword ascending or descending

77
Q

How does the group by and where again clause fit into FLWR

A

We have group by underneath the original where clause
- Though just like in SQL, we can also have the where clause after group by - known as HAVING in SQL (it does not affect GROUP BY an queries the results after group by was performed on them)

78
Q

How do we get distinct values in XQuery?

A

In XQuery, it’s a function, called distinct-values. Before it outputs the distinct values, it also converts elements to strings

79
Q

where do we put the distinct-values function if we want all the values returned to be distinct not just the values to be distinct for every path (such as every student)

A
  • So we wrap the distinct values around everything and this will give us the desired output.
  • so we write it before like distinct-values(let $doc:= “mydoc.xml”…)
80
Q

What are the 3 general approaches to storing XML documents in a relational database

A
  • Store XML documents as entries of a table
  • Store XML documents in schema-independent form
  • Store XML documents in shredded form across a number of attributes and relations
81
Q

How do we store XML in an attribute

A

So we create a table XMLStaff, you have the 3 attributes, doc number, doc date and some staff data.
- Then we can input into this XMLStaff some values for the doc number and doc date, then when we want to insert the XML data, you write XML(‘’) then write whatever the XML data is inside the quotation marks, like a string.

82
Q

How do we store XML in a Schema-Independent representation?

A
  • Store the xml file as a tree inside our database.
  • Since XML is a tree structure, each node may have only one parent.
  • The rootID attribute allows a query on a particular node to be linked back to its document node.
  • While this doesn’t depend on a schema, the recursive nature of structure can cause performance problems when searching for specific paths.
  • To overcome this, we create a denormalised index (table) containing combinations of path expressions and a link to node and parent node.
83
Q

How do we store XML in Shredded Form?

A
  • extract all the data from your database and then put it into your database by spreading it over a number of attributes in one or more relations
  • may make it easier to index values of some elements, provided these elements are placed into their own attributes.
  • Also possible to add some additional data relating to hierarchical nature of the XML, making it possible to recompose original structure and ordering, and to allow the XML to be updated.
  • however with this approach you also have to create an appropriate database structure.
84
Q

What is a noSQl database?

A

It doesn’t mean there’s no SQL involved, it just means that it’s not entirely SQL or relational database

85
Q

What are noSQL databases used for?

A

just collect all the data together for each user
- whereas secure data needs to be stored in reliable relational databases

86
Q

What are some NoSQL Database Characteristics

A

Designed to guarantee:
- Every non-failing node should always be accepting new queries (shouldn’t lock)
- Not the same version of consistency as in ACID
- Add scalability by just adding new computers into the network
- We want to achieve high performance by just doing simple transactions such as look-ups based on keys, and only allowing insertion of keys with the corresponding values. simple process can be done even faster than in relational databases

87
Q

What is Partition tolerance

A
  • Partition tolerance means that if a connection between nodes fails or a bunch of nodes fail, the remaining subnetworks can still operate despite missing information from another/multiple nodes
88
Q

What is the CAP Theorem?

A
  • we cannot achieve all three of these properties simultaneously
  • You can only select two of these properties at once
89
Q

What do NoSQL databases guarantee (Not ACID)

A

BASE

90
Q

What does BASE mean?

A
  • Basically available states that you should be able to answer queries nearly all the time as long as computers are running.
  • Soft state and eventually consistent mean that a database state might occasionally be inconsistent. So the result you get in one place may be different to the result you get from another place because they might not have all information available at all nodes but eventually the information and the system will be made consistent.
91
Q

What are the common NoSQL Database

A

Key value stores only let you store key and value pairs. So you can find the keys very very fast. No other operations can be done. Apart from inserting/searching for keys.
A document store does the same as a key value store, except here the value will typically be some semi-structured data and can involve look-up of data that is stored next to a key
- column stores
- Graph databases

92
Q

What are key-value stores?

A
  • the simplest kind of database system
  • a collection of tables and each table consists of key value pairs
  • In essence it’s just one big index
93
Q

Describe distributed storage with key-value pairs

A
  • you store each key value pair at some nodes(databases in your distributed storage)
  • you spread the pairs over the different computers
  • you assign the values for a key to an integer between 0 and 2 to the n-1 where n is some sufficiently large number, so you can space out your nodes
  • you use a hash function (function that for same input always gives the same output)
  • after determining a hash function you distribute to each node(computer) these integers
  • you can have multiple versions of the same node in many places
  • so in the diagram above, we have 3 copies storing A, 3 copies storing B and only 2 copies storing C which could mean computer C may be less powerful than A and B
  • then the key value pair is assigned to integer i of the node that comes next.
94
Q

How can we add scalability to distributed storage of key value pairs

A
  • Scalability is fairly simple, if you need more power just add in a new computer/ add in at any location in the cycle and move the key value pairs around appropriately to balance out number of pairs at each node
  • Scalability is easy, just add in new nodes, move the key value pairs around
95
Q

How do we ensure availability using replication on distributed storage of key value pairs?

A
  • We just store replicas (backups) on consecutive(next to each other) nodes in clockwise order. If you replicate something twice, it will be stored in the following node and the node following the following node. So if the following node fails, we can get it from the one after.
96
Q

what is versioning?

A
  • allows multiple versions of a data item
  • if a newer version of a data item is not yet available you haven’t stored it yet on all of these replicas, you get the most recent available version.
  • This is typically fine for the majority of applications but if it’s not we assign a vector clock to each version of an item X which is a list of nodes and timestamps.
  • The node corresponds to where it’s stored and then we have a timestamp which is a local time on a node right where this item was written.
  • We use the vector clock to decide if a version originated from another version and this is done by checking that a number is smaller than or equal to another vector clock
97
Q

Describe document stores

A
  • Databases that store collections of “documents”
  • the object ID is typically generated in some way. Could just be incremental.
  • the document value part is typically written in JSON (think of it as XML though)
98
Q

What is the main difference between XML and JSON?

A

JSON is shorter

99
Q

When are document stores useful to use?

A
  • When a user asks about a restaurant they want information about that restaurant but not about other restaurants
  • Makes sense to use it in this case because the amount and what type of information you want varies from restaurant to restaurant
100
Q

How do you update document stores?

A

If you want to update it you do it only on the master
Whenever you want to do a read, you by default ask the master

101
Q

how do you do replication of document stores NoSql database?

A

The way you do this is by splitting your collection into horizontal fragments, then based on the shard key (an indexed field that exists in all documents) you distribute them over different nodes
For replications you copy entire fragmentations to other nodes using a master/slave approach where you have one primary copy on the master and a bunch of secondary copies (the replicas)
You only have to instruct the master to update and this node will instruct the replicas to update

102
Q

What is the shard (or shared) key?

A
  • used in vertical fragmentation to be able to natural join all of the fragments back together
  • it’s the attribute that all the fragments have to include with their data
103
Q

What are column store NoSQL databases?

A
  • You have multiple levels of columns, from an abstract POV it looks similar to a table for SQL
  • We have column qualifier which is the name of the column
  • How it differs from typical tables is we have a column family, so you don’t need to specify all of the column qualifiers in a column family, some can be left empty
    -Different rows can have different qualifiers
  • Cells just hold values
  • To reference an item you specify the column family name then the column qualifier
104
Q

What is the scan keyword used for?

A

Scan is used for going through the whole document (table)

105
Q

What is the HBase techique/

A
  • uses two levels of fragmentation: a top and a bottom level
  • horizontal fragmentation, different regions can store different column families (subsets of the columns in different places)
  • bottom level: regions store different column families in different nodes
106
Q

Describe graph databases

A
  • Store the data as a graph, it’s different to XML and JSON trees
  • there are no requirements: such as a root node, they can all link to each other, this graph doesn’t have to be a nice tree
  • Graphs are slower than other tools