gbcaptain / googleinterviewquestion_coding Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 16 KB

Cracking the coding interview

C++ 94.60% Makefile 2.01% Shell 3.39%

c-plus-plus

googleinterviewquestion_coding's People

Contributors

Watchers

googleinterviewquestion_coding's Issues

Chapter 12 | System Design and Memory Limits

11.1 If you were integrating a feed of end of day stock price information (open, high, low, and closing price) for 5,000 companies, how would you do it? You are responsible for the development, rollout and ongoing monitoring and maintenance of the feed. Describe the different methods you considered and why you would recommend your approach. The feed is delivered once per trading day in a comma-separated format via an FTP site. The feed will be used by 1000 daily users in a web application.

SOLUTION
Let’s assume we have some scripts which are scheduled to get the data via FTP at the end of the day. Where do we store the data? How do we store the data in such a way that we can do various analyses of it?
Proposal 1
Keep the data in text files. This would be very difficult to manage and update, as well as very hard to query. Keeping unorganized text files would lead to a very inefficient data model.

Proposal 2
We could use a database. This provides the following benefits:

Logical storage of data.
Facilitates an easy way of doing query processing over the data.
Example: return all stocks having open > N AND closing price < M
Advantages:
Makes the maintenance easy once installed properly.
Roll back, backing up data, and security could be provided using standard database features. We don’t have to “reinvent the wheel.”

Proposal 3
If requirements are not that broad and we just want to do a simple analysis and distribute the data, then XML could be another good option.
Our data has fixed format and fixed size: company_name, open, high, low, closing price. The XML could look like this:

<root>
<date value=“2008-10-12”>
<company name=“foo”>
<open>126.23</open>
<high>130.27</high>
<low>122.83</low>
<closingPrice>127.30</closingPrice>
</company>
<company name=“bar”>
<open>52.73</open>
<high>60.27</high>
<low>50.29</low>
<closingPrice>54.91</closingPrice>
</company>
</date>
<date value=“2008-10-11”> . . . </date>
</root>

Benefits:

Very easy to distribute. This is one reason that XML is a standard data model to share /distribute data.
Efficient parsers are available to parse the data and extract out only desired data.
We can add new data to the XML file by carefully appending data. We would not have to re-query the database.
However, querying the data could be difficult

11.2 How would you design the data structures for a very large social network (Facebook, LinkedIn, etc)? Describe how you would design an algorithm to show the connection, or path, between two people (e.g., Me -> Bob -> Susan -> Jason -> You).

SOLUTION
Approach:
Forget that we’re dealing with millions of users at first. Design this for the simple case.
We can construct a graph by assuming every person is a node and if there is an edge between two nodes, then the two people are friends with each other.

class Person {
Person[] friends;
// Other info
}

If I want to find the connection between two people, I would start with one person and do a simple breadth first search.
But... oh no! Millions of users!
When we deal with a service the size of Orkut or Facebook, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work—our friends may not live on the same machine as us. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:

For each friend ID: int machine_index = lookupMachineForUserID(id);
Go to machine machine_index
Person friend = lookupFriend(machine_index);
There are more optimizations and follow up questions here than we could possibly discuss, but here are just a few thoughts.
Optimization: Reduce Machine Jumps
Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch these jumps—e.g., if 5 of my friends live on one machine, I should look them up all at once.
Optimization: Smart Division of People and Machines
People are much more likely to be friends with people who live in the same country as them. Rather than randomly dividing people up across machines, try to divvy them up by country, city, state, etc. This will reduce the number of jumps.
Question: Breadth First Search usually requires “marking” a node as visited. How do you do that in
this case?
Usually, in BFS, we mark a node as visited by setting a flag visited in its node class. Here, we don’t want to do that (there could be multiple searches going on at the same time, so it’s bad to just edit our data). In this case, we could mimic the marking of nodes with a hash table to lookup a node id and whether or not it’s been visited.
Other Follow-Up Questions:

In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where you start traversing?
The following code demonstrates our algorithm:

public class Server {
ArrayList<Machine> machines = new ArrayList<Machine>();
 }

 public class Machine {
 public ArrayList<Person> persons = new ArrayList<Person>();
 public int machineID;
 }

 public class Person {
 private ArrayList<Integer> friends;
 private int ID;
 private int machineID;
 private String info;
 private Server server = new Server();

 public String getInfo() { return info; }
 public void setInfo(String info) {
 this.info = info;
 }

 public int[] getFriends() {
 int[] temp = new int[friends.size()];
 for (int i = 0; i < temp.length; i++) {
 temp[i] = friends.get(i);
 }
 return temp;
 }
 public int getID() { return ID; }
 public int getMachineID() { return machineID; }
 public void addFriend(int id) { friends.add(id); }

 // Look up a person given their ID and Machine ID
 public Person lookUpFriend(int machineID, int ID) {
 for (Machine m : server.machines) {
 if (m.machineID == machineID) {
 for (Person p : m.persons) {
 if (p.ID == ID){
 return p;
 }
 }
 }
 }
 return null;
 }

 // Look up a machine given the machine ID
 public Machine lookUpMachine(int machineID) {
 for (Machine m:server.machines) {
 if (m.machineID == machineID)
 return m;
 }
 return null;
 }

 public Person(int iD, int machineID) {
 ID = iD;
 this.machineID = machineID;
 }
 }

11.3 Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB of memory.
FOLLOW UP
What if you have only 10 MB of memory?

SOLUTION
There are a total of 2^32, or 4 billion, distinct integers possible. We have 1 GB of memory, or 8 billion bits.
Thus, with 8 billion bits, we can map all possible integers to a distinct bit with the available memory. The logic is as follows:

Create a bit vector (BV) of size 4 billion.
Initialize BV with all 0’s
Scan all numbers (num) from the file and write BV[num] = 1;
Now scan again BV from 0th index
Return the first index which has 0 value.

 byte[] bitfield = new byte [0xFFFFFFF/8];
 void findOpenNumber2() throws FileNotFoundException {
 Scanner in = new Scanner(new FileReader(“input_file_q11_4.txt”));
 while (in.hasNextInt()) {
 int n = in.nextInt ();
 /* Finds the corresponding number in the bitfield by using the
 * OR operator to set the nth bit of a byte (e.g.. 10 would
 * correspond to the 2nd bit of index 2 in the byte array). */
 bitfield [n / 8] |= 1 << (n % 8);
 }

 for (int i = 0 ; i < bitfield.length; i++) {
 for (int j = 0; j < 8; j++) {
 /* Retrieves the individual bits of each byte. When 0 bit
 * is found, finds the corresponding value. */
 if ((bitfield[i] & (1 << j)) == 0) {
 System.out.println (i * 8 + j);
 return;
 }
 }
 }
 }

Follow Up: What if we have only 10 MB memory?
It’s possible to find a missing integer with just two passes of the data set. We can divide up the integers into blocks of some size (we’ll discuss how to decide on a size later). Let’s just assume that we divide up the integers into blocks of 1000. So, block 0 represents the numbers 0 through 999, block 1 represents blocks 1000 - 1999, etc. Since the range of ints is finite, we know that the number of blocks needed is finite.
In the first pass, we count how many ints are in each block. That is, if we see 552, we know that that is in block 0, we increment counter[0]. If we see 1425, we know that that is in block 1, so we increment counter[1].
At the end of the first pass, we’ll be able to quickly spot a block that is missing a number. If our block size is 1000, then any block which has fewer than 1000 numbers must be missing a number. Pick any one of those blocks.
In the second pass, we’ll actually look for which number is missing. We can do this by creating a simple bit vector of size 1000. We iterate through the file, and for each number that should be in our block, we set the appropriate bit in the bit vector. By the end, we’ll know which number (or numbers) is missing.
Now we just have to decide what the block size is.
A quick answer is 2^20 values per block. We will need an array with 2^12 block counters and a bit vector in 2^17 bytes. Both of these can comfortably fit in 10*2^20 bytes.
What’s the smallest footprint? When the array of block counters occupies the same memory as the bit vector. Let N = 2^32.

counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4

It’s possible to find a missing integer with just under 65KB (or, more exactly, sqrt(2)*2^15 bytes).

 int bitsize = 1048576; // 2^20 bits (2^17 bytes)
 int blockNum = 4096; // 2^12
 byte[] bitfield = new byte[bitsize/8];
 int[] blocks = new int[blockNum];

 void findOpenNumber() throws FileNotFoundException {
 int starting = -1;
 Scanner in = new Scanner (new FileReader (“input_file_q11_4.txt”));
 while (in.hasNextInt()) {
 int n = in.nextInt();
 blocks[n / (bitfield.length * 8)]++;
 }

 for (int i = 0; i < blocks.length; i++) {
 if (blocks[i] < bitfield.length * 8){
 /* if value < 2^20, then at least 1 number is missing in
 * that section. */
 starting = i * bitfield.length * 8;
 break;
 }
 }

 in = new Scanner(new FileReader(“input_file_q11_4.txt”));
 while (in.hasNextInt()) {
 int n = in.nextInt();
 /* If the number is inside the block that’s missing numbers,
 * we record it */
 if( n >= starting && n < starting + bitfield.length * 8){
 bitfield [(n-starting) / 8] |= 1 << ((n - starting) % 8);
 }
 }

 for (int i = 0 ; i < bitfield.length; i++) {
 for (int j = 0; j < 8; j++) {
 /* Retrieves the individual bits of each byte. When 0 bit
 * is found, finds the corresponding value. */
 if ((bitfield[i] & (1 << j)) == 0) {
 System.out.println(i * 8 + j + starting);
 return;
 }
 }
 }
 }

11.4 You have an array with all the numbers from 1 to N, where N is at most 32,000. The array may have duplicate entries and you do not know what N is. With only 4KB of memory available, how would you print all duplicate elements in the array?

SOLUTION
We have 4KB of memory which means we can address up to 8 * 4 * (2^10) bits. Note that 32* (2^10) bits is greater than 32000. We can create a bit vector with 32000 bits, where each bit represents one integer.
NOTE: While this isn’t an especially difficult problem, it’s important to implement this cleanly. We will define our own bit vector class to hold a large bit vector.

 public static void checkDuplicates(int[] array) {
 BitSet bs = new BitSet(32000);
 for (int i = 0; i < array.length; i++) {
 int num = array[i];
 int num0 = num - 1; // bitset starts at 0, numbers start at 1
 if (bs.get(num0)) {
 System.out.println(num);
 } else {
 bs.set(num0);
 }
 }
 }

 class BitSet {
 int[] bitset;

 public BitSet(int size) {
 bitset = new int[size >> 5]; // divide by 32
 }

 boolean get(int pos) {
 int wordNumber = (pos >> 5); // divide by 32
 int bitNumber = (pos & 0x1F); // mod 32
 return (bitset[wordNumber] & (1 << bitNumber)) != 0;
 }

 void set(int pos) {
 int wordNumber = (pos >> 5); // divide by 32
 int bitNumber = (pos & 0x1F); // mod 32
 bitset[wordNumber] |= 1 << bitNumber;
 }
 }

11.5 If you were designing a web crawler, how would you avoid getting into infinite loops?

SOLUTION
First, how does the crawler get into a loop? The answer is very simple: when we re-parse an already parsed page. This would mean that we revisit all the links found in that page, and this would continue in a circular fashion.
Be careful about what the interviewer considers the “same” page. Is it URL or content? One could easily get redirected to a previously crawled page.
So how do we stop visiting an already visited page? The web is a graph-based structure, and we commonly use DFS (depth first search) and BFS (breadth first search) for traversing graphs. We can mark already visited pages the same way that we would in a BFS/DFS.
We can easily prove that this algorithm will terminate in any case. We know that each step of the algorithm will parse only new pages, not already visited pages. So, if we assume that we have N number of unvisited pages, then at every step we are reducing N (N-1) by 1. That proves that our algorithm will continue until they are only N steps.
SUGGESTIONS AND OBSERVATIONS

This question has a lot of ambiguity. Ask clarifying questions!
Be prepared to answer questions about coverage.
What kind of pages will you hit with a DFS versus a BFS?
What will you do when your crawler runs into a honey pot that generates an infinite subgraph for you to wander about?

11.6 You have a billion urls, where each is a huge page. How do you detect the duplicate documents?

SOLUTION
Observations:

Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter representation of pages in memory. A hash is an obvious choice for this.
Billions of urls exist so we don’t want to compare every page with every other page (that would be O(n^2)).
Based on the above two observations we can derive an algorithm which is as follows:
Iterate through the pages and compute the hash table of each one.
Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.
This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?

How much space does each page take up in the hash table?
- Each page hashes to a four byte value.
- Each url is an average of 30 characters, so that’s another 30 bytes at least.
- Each url takes up roughly 34 bytes.
34 bytes * 1 billion = 31.6 gigabytes. We’re going to have trouble holding that all in memory!
What do we do?
We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.
We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
Or, we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.
- First, we hash the document to get a hash value v
- v%n tells us which machine this document’s hash table can be found on.
- v / n is the value in the hash table that is located on its machine.

11.7 You have to design a database that can store terabytes of data. It should support efficient range queries. How would you do it?

SOLUTION
Construct an index for each field that requires range queries. Use a B+ tree to implement the index. A B+ tree organizes sorted data for efficient insertion, retrieval and removal of records. Each record is identified by a key (for this problem, it is the field value). Since it is a dynamic, multilevel index, finding the beginning of the range depends only on the height of the tree, which is usually quite small. Record references are stored in the leaves, sorted by the key. Additional records can be found by following a next block reference. Records will be sequentially available until the key value reaches the maximum value specified in the query. Thus, runtimes will be dominated by the number of elements in a range.
Avoid using trees that store data at interior nodes, as traversing the tree will be expensive since it won’t be resident in memory

Chapter 1 | Arrays & Strings

1.1 Implement an algorithm to determine if a string has all unique characters. What if you can not use additional data structures?
_

For simplicity, assume char set is ASCII (if not, we need to increase the storage size. The rest of the logic would be the same). NOTE: This is a great thing to point out to your interviewer!
1 public static boolean isUniqueChars2(String str) {
2 boolean[] char_set = new boolean[256];
3 for (int i = 0; i < str.length(); i++) {
4 int val = str.charAt(i);
5 if (char_set[val]) return false;
6 char_set[val] = true;
7 }
8 return true;
9 }
Time complexity is O(n), where n is the length of the string, and space complexity is O(n).
We can reduce our space usage a little bit by using a bit vector. We will assume, in the below code, that the string is only lower case ‘a’ through ‘z’. This will allow us to use just a single int
1 public static boolean isUniqueChars(String str) {
2 int checker = 0;
3 for (int i = 0; i < str.length(); ++i) {
4 int val = str.charAt(i) - ‘a’;
5 if ((checker & (1 << val)) > 0) return false;
6 checker |= (1 << val);
7 }
8 return true;
9 }
Alternatively, we could do the following:

Check every char of the string with every other char of the string for duplicate occurrences. This will take O(n^2) time and no space.

If we are allowed to destroy the input string, we could sort the string in O(n log n) time and then linearly check the string for neighboring characters that are identical. Careful, though - many sorting algorithms take up extra space.

1.2 Write code to reverse a C-Style String. (C-String means that “abcd” is represented as five characters, including the null character.)
_

This is a classic interview question. The only “gotcha” is to try to do it in place, and to be careful for the null character.
1 void reverse(char *str) {
2 char * end = str;
3 char tmp;
4 if (str) {
5 while (*end) {
6 ++end;
7 }
8 --end;
9 while (str < end) {
10 tmp = *str;
11 *str++ = *end;
12 *end-- = tmp;
13 }
14 }
15 }

1.3 Design an algorithm and write code to remove the duplicate characters in a string without using any additional buffer. NOTE: One or two additional variables are fine. An extra copy of the array is not.
FOLLOW UP
Write the test cases for this method.
_

1.4 Write a method to decide if two strings are anagrams or not.
_

1.5 Write a method to replace all spaces in a string with ‘%20’.
_

1.6 Given an image represented by an NxN matrix, where each pixel in the image is 4 bytes, write a method to rotate the image by 90 degrees. Can you do this in place?
_

1.7 Write an algorithm such that if an element in an MxN matrix is 0, its entire row and column is set to 0.
_

1.8 Assume you have a method isSubstring which checks if one word is a substring of another. Given two strings, s1 and s2, write code to check if s2 is a rotation of s1 using only one call to isSubstring (i.e., “waterbottle” is a rotation of “erbottlewat”).
_

Chapter 17 | Networking

17.1 Explain what happens, step by step, after you type a URL into a browser. Use as much detail as possible.

SOLUTION

There’s no right, or even complete, answer for this question. This question allows you to go into arbitrary amounts of detail depending on what you’re comfortable with. Here’s a start though:

Browser contacts the DNS server to find the IP address of URL.

DNS returns back the IP address of the site.

Browser opens TCP connection to the web server at port 80.

Browser fetches the html code of the page requested.

Browser renders the HTML in the display window.

Browser terminates the connection when window is closed.
One of the most interesting steps is Step 1 and 2 - “Domain Name Resolution.” The web addresses we type are nothing but an alias to an IP address in human readable form. Mapping of domain names and their associated Internet Protocol (IP) addresses is managed by the Domain Name System (DNS), which is a distributed but hierarchical entity.
Each domain name server is divided into zones. A single server may only be responsible for knowing the host names and IP addresses for a small subset of a zone, but DNS servers can work together to map all domain names to their IP addresses. That means if one domain name server is unable to find the IP addresses of a requested domain then it requests the information from other domain name servers.

17.2 Explain any common routing protocol in detail. For example: BGP, OSPF, RIP.

SOLUTION

Depending on the reader’s level of understanding, knowledge, interest or career aspirations, he or she may wish to explore beyond what is included here. Wikipedia and other websites are great places to look for a deeper understanding. We will provide only a short summary.
BGP: Border Gateway Protocol
BGP is the core routing protocol of the Internet. “When a BGP router first comes up on the Internet, either for the first time or after being turned off, it establishes connections with the other BGP routers with which it directly communicates. The first thing it does is download the entire routing table of each neighboring router. After that it only exchanges much shorter update messages with other routers.
BGP routers send and receive update messages to indicate a change in the preferred path to reach a computer with a given IP address. If the router decides to update its own routing tables because this new path is better, then it will subsequently propagate this information to all of the other neighboring BGP routers to which it is connected, and they will in turn decide whether to update their own tables and propagate the information further.”
Borrowed from http://www.livinginternet.com/i/iw_route_egp_bgp.htm.
RIP: Routing Information Protocol
“RIP provides the standard IGP protocol for local area networks, and provides great network stability, guaranteeing that if one network connection goes down the network can quickly adapt to send packets through another connection. “
“What makes RIP work is a routing database that stores information on the fastest route from computer to computer, an update process that enables each router to tell other routers which route is the fastest from its point of view, and an update algorithm that enables each router to update its database with the fastest route communicated from neighboring routers.”
Borrowing from http://www.livinginternet.com/i/iw_route_igp_rip.htm.
OSPF: Open Shortest Path First
“Open Shortest Path First (OSPF) is a particularly efficient IGP routing protocol that is faster than RIP, but also more complex.”
The main difference between OSPF and RIP is that RIP only keeps track of the closest router for each destination address, while OSPF keeps track of a complete topological database of all connections in the local network. The OSPF algorithm works as described below.

Startup. When a router is turned on it sends Hello packets to all of its neighbors, re-ceives their Hello packets in return, and establishes routing connections by synchroniz-ing databases with adjacent routers that agree to synchronize.

Update. At regular intervals each router sends an update message called its “link state” describing its routing database to all the other routers, so that all routers have the same description of the topology of the local network.

Shortest path tree. Each router then calculates a mathematical data structure called a “shortest path tree” that describes the shortest path to each destination address and therefore indicates the closest router to send to for each communication; in other words -- “open shortest path first”.
See http://www.livinginternet.com/i/iw_route_igp_ospf.htm.

17.3 Compare and contrast the IPv4 and IPv6 protocols.

SOLUTION

IPv4 and IPv6 are the internet protocols applied at the network layer. IPv4 is the most widely used protocol right now and IPv6 is the next generation protocol for internet.

IPv4 is the fourth version of Internet protocol which uses 32 bit addressing whereas IPv6 is a next generation internet protocol which uses 128 bits addressing.

IPv4 allows 4,294,967,296 unique addresses where as IPv6 can hold 340-undecillion (34, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000) unique IP addresses.

IPv4 has different class types: A,B,C,D and E. Class A, Class B, and Class C are the three classes of addresses used on IP networks in common practice. Class D addresses are reserved for multicast. Class E addresses are simply reserved, meaning they should not be used on IP networks (used on a limited basis by some research organizations for experimental purposes).

IPv6 addresses are broadly classified into three categories:

Unicast addresses: A Unicast address acts as an identifier for a single interface. An IPv6 packet sent to a Unicast address is delivered to the interface identified by that address.

Multicast addresses: A Multicast address acts as an identifier for a group / set of interfaces that may belong to the different nodes. An IPv6 packet delivered to a multicast address is delivered to the multiple interfaces.

Anycast addresses: Anycast addresses act as identifiers for a set of interfaces that may belong to the different nodes. An IPv6 packet destined for an Anycast address is delivered to one of the interfaces identified by the address.

IPv4 address notation: 239.255.255.255, 255.255.255.0

IPv6 addresses are denoted by eight groups of hexadecimal quartets separated by colons in between them.

An example of a valid IPv6 address: 2001:cdba:0000:0000:0000:0000:3257:9652
Because of the increase in the population, there is a need of Ipv6 protocol which can provide solution for:

Increased address space

More efficient routing

Reduced management requirement

Improved methods to change ISP

Better mobility support

Multi-homing

Security

Scoped address: link-local, site-local and global-address space

17.4 What is a network / subnet mask? Explain how host A sends a message / packet to host B when: (a) both are on same network and (b) both are on different networks. Explain which layer makes the routing decision and how.

SOLUTION

A mask is a bit pattern used to identify the network/subnet address. The IP address consists of two components: the network address and the host address.
The IP addresses are categorized into different classes which are used to identify the network address.
Example: Consider IP address 152.210.011.002. This address belongs to Class B, so:
Network Mask: 11111111.11111111.00000000.00000000
Given Address: 10011000.11010101.00001011.00000010
By ANDing Network Mask and IP Address, we get the following network address:
10011000.11010101.00000000.00000000 (152.210.0.0)
Host address: 00001011.00000010
Similarly, a network administrator can divide any network into sub-networks by using subnet mask. To do this, we further divide the host address into two or more subnets.
For example, if the above network is divided into 18 subnets (requiring a minimum of 5 bits to represent 18 subnets), the first 5 bits will be used to identify the subnet address.
Subnet Mask: 11111111.11111111.11111000.00000000 (255.255.248.0)
Given Address: 10011000.11010101.00001011.00000010
So, by ANDing the subnet mask and the given address, we get the following subnet address: 10011000.11010101.00001000.00000000 (152.210.1.0)
How Host A sends a message/packet to Host B:
When both are on same network: the host address bits are used to identify the host within the network.
Both are on different networks: the router uses the network mask to identify the network and route the packet. The host can be identified using the network host address.
The network layer is responsible for making routing decisions. A routing table is used to store the path information and the cost involved with that path, while a routing algorithm uses the routing table to decide the path on which to route the packets.
Routing is broadly classified into Static and Dynamic Routing based on whether the table is fixed or it changes based on the current network condition.

17.5 What are the differences between TCP and UDP? Explain how TCP handles reliable delivery (explain ACK mechanism), flow control (explain TCP sender’s / receiver’s window) and congestion control.

SOLUTION

TCP (Transmission Control Protocol): TCP is a connection-oriented protocol. A connection can be made from client to server, and from then on any data can be sent along that connection.

Reliable - when you send a message along a TCP socket, you know it will get there unless the connection fails completely. If it gets lost along the way, the server will re-request the lost part. This means complete integrity; data will not get corrupted.

Ordered - if you send two messages along a connection, one after the other, you know the first message will get there first. You don’t have to worry about data arriving in the wrong order.

Heavyweight - when the low level parts of the TCP “stream” arrive in the wrong order, resend requests have to be sent. All the out of sequence parts must be put back together, which requires a bit of work.
UDP(User Datagram Protocol): UDP is connectionless protocol. With UDP you send messages (packets) across the network in chunks.

Unreliable - When you send a message, you don’t know if it’ll get there; it could get lost on the way.

Not ordered - If you send two messages out, you don’t know what order they’ll arrive in.

Lightweight - No ordering of messages, no tracking connections, etc. It’s just fire and forget! This means it’s a lot quicker, and the network card / OS have to do very little work to translate the data back from the packets.
Explain how TCP handles reliable delivery (explain ACK mechanism), flow control (explain TCP sender’s/receiver’s window).
For each TCP packet, the receiver of a packet must acknowledge that the packet is received. If there is no acknowledgement, the packet is sent again. These guarantee that every single packet is delivered. ACK is a packet used in TCP to acknowledge receipt of a packet. A TCP window is the amount of outstanding (unacknowledged by the recipient) data a sender can send on a particular connection before it gets an acknowledgment back from the receiver that it has gotten some of it.
For example, if a pair of hosts are talking over a TCP connection that has a TCP window with a size of 64 KB, the sender can only send 64 KB of data and then it must wait for an acknowledgment from the receiver that some or all of the data has been received. If the receiver
acknowledges that all the data has been received, then the sender is free to send another 64 KB. If the sender gets back an acknowledgment from the receiver that it received the first 32 KB (which could happen if the second 32 KB was still in transit or it could happen if the second 32 KB got lost), then the sender can only send another additional 32 KB since it can’t have more than 64 KB of unacknowledged data outstanding (the second 32 KB of data plus the third).
Congestion Control
The TCP uses a network congestion avoidance algorithm that includes various aspects of an additive-increase-multiplicative-decrease scheme, with other schemes such as slow-start in order to achieve congestion avoidance.
There are different algorithms to solve the problem; Tahoe and Reno are the most well known. To avoid congestion collapse, TCP uses a multi-faceted congestion control strategy. For each connection, TCP maintains a congestion window, limiting the total number of unacknowledged packets that may be in transit end-to-end. This is somewhat analogous to TCP’s sliding window used for flow control. TCP uses a mechanism called slow start to increase the congestion window after a connection is initialized and after a timeout. It starts with a window of two times the maximum segment size (MSS). Although the initial rate is low, the rate of increase is very rapid: for every packet acknowledged, the congestion window increases by 1 MSS so that for every round trip time (RTT), the congestion window has doubled. When the congestion window exceeds a threshold ssthresh the algorithm enters a new state, called congestion avoidance. In some implementations (i.e., Linux), the initial ssthresh is large, and so the first slow start usually ends after a loss. However, ssthresh is updated at the end of each slow start, and will often affect subsequent slow starts triggered by timeouts.

Chapter 15 | Databases

Small Database Design

Imagine you are asked to design a system to represent a large, multi-location, apartment rental company.
What are the key objects?
Property. Building. Apartment. Tenant. Manager.
How do they relate to each other?
Many-to-Many:

A property could have multiple managers, and a manager could manage multiple properties.

One-to-Many:

A building can only be part of one property.
An apartment can only be part of one building.

15.1 Write a method to find the number of employees in each department.

SOLUTION
This problem uses a straight-forward join of Departments and Employees. Note that we use a left join instead of an inner join because we want to include Departments with 0 employees.

 select Dept_Name, Departments.Dept_ID, count(*) as ‘num_employees’
 from Departments
 left join Employees
 on Employees.Dept_ID = Departments.Dept_ID
 group by Departments.Dept_ID, Dept_Name

15.2 What are the different types of joins? Please explain how they differ and why certain types are better in certain situations.

SOLUTION
JOIN is used to combine the results of two tables. To perform a join, each of the tables must have at least one field which will be used to find matching records from the other table. The join type defines which records will go into the result set.
Let’s take for example two tables: one table lists “regular” beverages, and another lists the calorie-free beverages. Each table has two fields: the beverage name and its product code. The “code” field will be used to perform the record matching.

Regular Beverages:Name	Code
Budweiser	BUDWEISER
Coca-Cola	COCACOLA
Pepsi	PEPSI

Code	Name
COCACOLA	Diet Coca-Cola
FRESCA	Fresca
PEPSI	Diet Pepsi
PEPSI	Pepsi Light
Water	Purified Water

Let’s join this table by the code field. Whereas the order of the joined tables makes sense in some cases, we will consider the following statement:

[Beverage] JOIN [Calorie-Free Beverage]

i.e. [Beverage] is from the left of the join operator, and [Calorie-Free Beverage] is from the right.
1. INNER JOIN: Result set will contain only those data where the criteria match. In our example we will get 3 records: 1 with COCACOLA and 2 with PEPSI codes.
2. OUTER JOIN: OUTER JOIN will always contain the results of INNER JOIN, however it can contain some records that have no matching record in other table. OUTER JOINs are divided to following subtypes:
2.1. LEFT OUTER JOIN, or simply LEFT JOIN: The result will contain all records from the left table. If no matching records were found in the right table, then its fields will contain the NULL values. In our example, we would get 4 records. In addition to INNER JOIN results, BUDWEISER will be listed, because it was in the left table.
2.2. RIGHT OUTER JOIN, or simply RIGHT JOIN: This type of join is the opposite of LEFT JOIN; it will contain all records from the right table, and missing fields from the left table will contain NULL. If we have two tables A and B, then we can say that statement A LEFT JOIN B is equivalent to statement B RIGHT JOIN A.
In our example, we will get 5 records. In addition to INNER JOIN results, FRESCA and WATER records will be listed.
2.3. FULL OUTER JOIN
This type of join combines the results of LEFT and RIGHT joins. All records from both tables will be part of the result set, whether the matching record exists in the other table or not. If no matching record was found then the corresponding result fields will have a NULL value.
In our example, we will get 6 records.

15.3 What is denormalization? Explain the pros and cons.

SOLUTION
Denormalization is the process of attempting to optimize the performance of a database by adding redundant data or by grouping data. In some cases, denormalization helps cover up the inefficiencies inherent in relational database software. A relational normalized database imposes a heavy access load over physical storage of data even if it is well tuned for high performance.
A normalized design will often store different but related pieces of information in separate logical tables (called relations). If these relations are stored physically as separate disk files, completing a database query that draws information from several relations (a join operation) can be slow. If many relations are joined, it may be prohibitively slow. There are two strategies for dealing with this. The preferred method is to keep the logical design normalized, but allow the database management system (DBMS) to store additional redundant information on disk to optimize query response. In this case, it is the DBMS software’s responsibility to ensure that any redundant copies are kept consistent. This method is often implemented in SQL as indexed views (Microsoft SQL Server) or materialized views (Oracle). A view represents information in a format convenient for querying, and the index ensures that queries against the view are optimized.
The more usual approach is to denormalize the logical data design. With care, this can achieve a similar improvement in query response, but at a cost—it is now the database designer’s responsibility to ensure that the denormalized database does not become inconsistent. This is done by creating rules in the database called constraints, that specify how the redundant copies of information must be kept synchronized. It is the increase in logical complexity of the database design and the added complexity of the additional constraints that make this approach hazardous. Moreover, constraints introduce a trade-off, speeding up reads (SELECT in SQL) while slowing down writes (INSERT, UPDATE, and DELETE). This means a denormalized database under heavy write load may actually offer worse performance than its functionally equivalent normalized counterpart.
A denormalized data model is not the same as a data model that has not been normalized, and denormalization should only take place after a satisfactory level of normalization has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multivalued dependencies are handled appropriately.
From Wiki Link

15.4 Draw an entity-relationship diagram for a database with companies, people, and professionals (people who work for companies).

SOLUTION
People who work for companies are Professionals. So there is an ISA (is a) relationship between People and Professionals (or we could say that a Professional is derived from People).
Each Professional has additional information such as degree, work experiences, etc, in addition to the properties derived from People.
A Professional works for one company at a time, but Companies can hire many Professionals, so there is a Many to One relationship between Professionals and Companies. This “Works For” relationship can store attributes such as date of joining the company, salary, etc. These attributes are only defined when we relate a Professional with a Company.
A Person can have multiple phone numbers, which is why Phone is a multi-valued attribute.

15.5 Imagine a simple database storing information for students’ grades. Design what this database might look like, and provide a SQL query to return a list of the honor roll students (top 10%), sorted by their grade point average.

SOLUTION
In a simplistic database, we’ll have at least these three objects: Students, Courses, and CourseEnrollment. Students will have at least the student name and ID, and will likely have other personal information. Courses will contain the course name and ID, and will likely contain the course description, professor, etc. CourseEnrollment will pair Students and Courses, and will also contain a field for CourseGrade. We will assume that CourseGrade is an integer.
Our SQL query to get the list of honor roll students might look like this:

SELECT StudentName, GPA
FROM (
SELECT top 10 percent Avg(CourseEnrollment.Grade) AS GPA,
CourseEnrollment.StudentID
FROM CourseEnrollment
GROUP BY CourseEnrollment.StudentID
ORDER BY Avg(CourseEnrollment.Grade)) Honors
INNER JOIN Students ON Honors.StudentID = Students.StudentID

This database could get arbitrarily more complicated if we wanted to add in professor information, billing, etc.

gbcaptain / googleinterviewquestion_coding Goto Github PK

googleinterviewquestion_coding's People

Contributors

Watchers

googleinterviewquestion_coding's Issues

Chapter 12 | System Design and Memory Limits

Chapter 1 | Arrays & Strings

Chapter 17 | Networking

Chapter 15 | Databases

Small Database Design

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent