Beyond relational data Flashcards

Question

What is the value of standalone in the first line when we have a DTD for the XML file?

Answer 1

We set it to “no” because we do have a schema

Answer 2

+ means 1 or more of an item * means 0 or more of an item ? means 0 or 1 of an item

Answer 3

- first you define the element without the attribute by writing - then we define the attributes, by writing and

Answer 4

- #IMPLIED means that the data is optional

Answer 5

- #REQUIRED if this attribute can’t be left empty or Some value “COMPXXX” as a default value or #FIXED and some value if it’s a constant

Answer 6

- IDREF references one element - IDREFS references a list of elements

Answer 7

ID allows you to define a unique key to be associated with an element that you can use to point to this element later using IDREF

Answer 8

- Well formed and valid

Answer 9

That an XML document is well formed before passing information on to an application

Answer 10

- all elements must be within one root element - elements must be nested in a tree structure without any overlaps

Answer 11

XPath is basically an (advanced) “file” path in XML

Answer 12

- You can return all of them - Or you can return, the first, last or ith item depending on what you want

Answer 13

- write queries that return a set of values or nodes from an XML document - values are string, ints, reals, etc - nodes are the entire document, an element node or an attribute

Answer 14

- most basic path looks like this /E1/E2/E3/…En - this is a slash then the name of an attribute then a slash and so on until you reach a desired attribute - Whatever you reach by traversing down the path is what is returned.

Answer 15

- The result is returned in document order, so it will return them in the order they’re written in the x-file

Answer 16

- If you don’t start the expression with a / then it will evaluate it relative to the node, so it will return anything below the given node - so if we put student/name it will start at student as opposed to the root node students

Answer 17

- You write a command like before but in the last tag your write an @ and then the name of an attribute and it will output whatever attribute you wrote in there

Answer 18

Use this * wildcard symbol to return anything directly below the named attribute So /students/student/* will return the program code, module code

Answer 19

/axis::E1/axis::E2/axis::E3/ then a tag name, attribute name or *

Answer 20

- @ for attribute:: (returns an attribute) - child is the default so you don't have to write it - //E can replace descendant:: - .. is short for parent:: - . is short for self

Answer 21

A descendant is just everything you can reach by just going one step away from you

Answer 22

A descendant-or-self is everything you can reach by just going zero or more steps away from you

Answer 23

- They have the same parent and come before in document order

Answer 24

- To be a following sibling, it’s still important you have the same parent! But you just come after the self node in document order.

Answer 25

- To be a following sibling, it’s still important you have the same parent! But you just come after the self node in document order.

Answer 26

/axis::E1[C1]/axis::E2[C1]/axis::E3[C1]/ then a tag name, attribute name or *

Answer 27

We can do comparisons and do combinations of these using binary operators

Answer 28

means that if a book belongs to multiple categories it will take the first category in the document order and return it

Answer 29

last() returns the last child node which can be useful if we want the biggest item in a list as this is the way data is ordered

Answer 30

finds your parent’s parent

Answer 31

Returns true if the xpath returns a non-empty set

Answer 32

returns a book if it has a price attribute

Answer 33

If you write a string inside one of the square brackets it returns true if the string is not empty so if you write/data() in the condition, it checks that the string is not empty

Answer 34

It’s not really used as a query language. It’s just used to define paths in your documents

Answer 35

- a language for finding and extracting elements and attributes from XML documents

Answer 36

L - let this doc be the contents of .. F - for clause iterates over whatever element you’re looking at, if there’s multiples of them it assigns the first one to the first one and then to the second one, then to the third one and so on W - Where clause. So here we’re saying, where the E = ... R - Return clause where we say what we want to return each time we look at some element in out for loops or just once if there’s no for loop. In this case we want to return our element name.

Answer 37

Let and for = FROM where = WHERE Return = SELECT

Answer 38

you must write everything in lowercase letters

Answer 39

F - optional L - optional W - optional R - mandatory

Answer 40

- It lets you assign your document to a variable

Answer 41

In XQuery, all variable names start with a dollar sign $

Answer 42

It just executes whatever comes after the for clause. This could be more for clauses, or a let or where and then finally a return clause.

Answer 43

It will iterate through every student in the database

Answer 44

So you just have some condition inside the where clause, that evaluates the condition and if it’s true then it executes the corresponding return clause - just wrote the keyword where

Answer 45

Existential semantics in XQuery refers to the ability to query for the existence of elements or attributes in an XML document, rather than their specific values. What is returned is true rather than the data itself.

Answer 46

this still uses existential semantics, so for $s1/name = $s2/name, if both sides can find the same name, then it returns true, if one of the two or neither can then it doesn’t return it.

Answer 47

- We can also use some to say return true if some $variables in the XQuery expression satisfies the condition

Answer 48

- We can also use every to say return true if every $variable in the XQuery expression must satisfy the condition

Answer 49

We can fix this by writing the keyword as now this is only returning one item (similar to returning a tuple in Java) - and by putting curly brackets around each item that you want to return and a comma between them

Answer 50

We want to return each title and the authors separate if there was more than one so we use nested loops. So here we loop over the books, then loop over the authors for each book and then output the pair of the title and then the name.

Answer 51

Can also go Order by like in SQL, so we insert the order by clause in just above the return clause - just put keyword ascending or descending

Answer 52

We have group by underneath the original where clause - Though just like in SQL, we can also have the where clause after group by - known as HAVING in SQL (it does not affect GROUP BY an queries the results after group by was performed on them)

Answer 53

In XQuery, it’s a function, called distinct-values. Before it outputs the distinct values, it also converts elements to strings

Answer 54

- So we wrap the distinct values around everything and this will give us the desired output. - so we write it before like distinct-values(let $doc:= "mydoc.xml"...)

Answer 55

- Store XML documents as entries of a table - Store XML documents in schema-independent form - Store XML documents in shredded form across a number of attributes and relations

Answer 56

So we create a table XMLStaff, you have the 3 attributes, doc number, doc date and some staff data. - Then we can input into this XMLStaff some values for the doc number and doc date, then when we want to insert the XML data, you write XML(‘’) then write whatever the XML data is inside the quotation marks, like a string.

Answer 57

- Store the xml file as a tree inside our database. - Since XML is a tree structure, each node may have only one parent. - The rootID attribute allows a query on a particular node to be linked back to its document node. - While this doesn’t depend on a schema, the recursive nature of structure can cause performance problems when searching for specific paths. - To overcome this, we create a denormalised index (table) containing combinations of path expressions and a link to node and parent node.

Answer 58

- extract all the data from your database and then put it into your database by spreading it over a number of attributes in one or more relations - may make it easier to index values of some elements, provided these elements are placed into their own attributes. - Also possible to add some additional data relating to hierarchical nature of the XML, making it possible to recompose original structure and ordering, and to allow the XML to be updated. - however with this approach you also have to create an appropriate database structure.

Answer 59

It doesn’t mean there’s no SQL involved, it just means that it’s not entirely SQL or relational database

Answer 60

just collect all the data together for each user - whereas secure data needs to be stored in reliable relational databases

Answer 61

Designed to guarantee: - Every non-failing node should always be accepting new queries (shouldn’t lock) - Not the same version of consistency as in ACID - Add scalability by just adding new computers into the network - We want to achieve high performance by just doing simple transactions such as look-ups based on keys, and only allowing insertion of keys with the corresponding values. simple process can be done even faster than in relational databases

Answer 62

- Partition tolerance means that if a connection between nodes fails or a bunch of nodes fail, the remaining subnetworks can still operate despite missing information from another/multiple nodes

Answer 63

- we cannot achieve all three of these properties simultaneously - You can only select two of these properties at once

Answer 64

- Basically available states that you should be able to answer queries nearly all the time as long as computers are running. - Soft state and eventually consistent mean that a database state might occasionally be inconsistent. So the result you get in one place may be different to the result you get from another place because they might not have all information available at all nodes but eventually the information and the system will be made consistent.

Answer 65

Key value stores only let you store key and value pairs. So you can find the keys very very fast. No other operations can be done. Apart from inserting/searching for keys. A document store does the same as a key value store, except here the value will typically be some semi-structured data and can involve look-up of data that is stored next to a key - column stores - Graph databases

Answer 66

- the simplest kind of database system - a collection of tables and each table consists of key value pairs - In essence it’s just one big index

Answer 67

- you store each key value pair at some nodes(databases in your distributed storage) - you spread the pairs over the different computers - you assign the values for a key to an integer between 0 and 2 to the n-1 where n is some sufficiently large number, so you can space out your nodes - you use a hash function (function that for same input always gives the same output) - after determining a hash function you distribute to each node(computer) these integers - you can have multiple versions of the same node in many places - so in the diagram above, we have 3 copies storing A, 3 copies storing B and only 2 copies storing C which could mean computer C may be less powerful than A and B - then the key value pair is assigned to integer i of the node that comes next.

Answer 68

- Scalability is fairly simple, if you need more power just add in a new computer/ add in at any location in the cycle and move the key value pairs around appropriately to balance out number of pairs at each node - Scalability is easy, just add in new nodes, move the key value pairs around

Answer 69

- We just store replicas (backups) on consecutive(next to each other) nodes in clockwise order. If you replicate something twice, it will be stored in the following node and the node following the following node. So if the following node fails, we can get it from the one after.

Answer 70

- allows multiple versions of a data item - if a newer version of a data item is not yet available you haven’t stored it yet on all of these replicas, you get the most recent available version. - This is typically fine for the majority of applications but if it’s not we assign a vector clock to each version of an item X which is a list of nodes and timestamps. - The node corresponds to where it’s stored and then we have a timestamp which is a local time on a node right where this item was written. - We use the vector clock to decide if a version originated from another version and this is done by checking that a number is smaller than or equal to another vector clock

Answer 71

- Databases that store collections of “documents” - the object ID is typically generated in some way. Could just be incremental. - the document value part is typically written in JSON (think of it as XML though)

Answer 72

JSON is shorter

Answer 73

- When a user asks about a restaurant they want information about that restaurant but not about other restaurants - Makes sense to use it in this case because the amount and what type of information you want varies from restaurant to restaurant

Answer 74

If you want to update it you do it only on the master Whenever you want to do a read, you by default ask the master

Answer 75

The way you do this is by splitting your collection into horizontal fragments, then based on the shard key (an indexed field that exists in all documents) you distribute them over different nodes For replications you copy entire fragmentations to other nodes using a master/slave approach where you have one primary copy on the master and a bunch of secondary copies (the replicas) You only have to instruct the master to update and this node will instruct the replicas to update

Answer 76

- used in vertical fragmentation to be able to natural join all of the fragments back together - it's the attribute that all the fragments have to include with their data

Answer 77

- You have multiple levels of columns, from an abstract POV it looks similar to a table for SQL - We have column qualifier which is the name of the column - How it differs from typical tables is we have a column family, so you don’t need to specify all of the column qualifiers in a column family, some can be left empty -Different rows can have different qualifiers - Cells just hold values - To reference an item you specify the column family name then the column qualifier

Answer 78

Scan is used for going through the whole document (table)

Answer 79

- uses two levels of fragmentation: a top and a bottom level - horizontal fragmentation, different regions can store different column families (subsets of the columns in different places) - bottom level: regions store different column families in different nodes

Answer 80

- Store the data as a graph, it’s different to XML and JSON trees - there are no requirements: such as a root node, they can all link to each other, this graph doesn’t have to be a nice tree - Graphs are slower than other tools