Log in

Mark Gritter's Journal
[Most Recent Entries] [Calendar View] [Friends]

Below are 20 journal entries, after skipping by the 20 most recent ones recorded in Mark Gritter's LiveJournal:

[ << Previous 20 -- Next 20 >> ]
Thursday, September 24th, 2015
10:03 pm
Ambiguity might or might not work in our favor
Tintri is on TechCrunch's "Unicorn Leaderboard" http://techcrunch.com/unicorn-leaderboard/ but at the very bottom. We have not disclosed a valuation, ever. (Also, as of this writing, our sector is listed as "healthcare." Don't believe everything you read on the Internet.)

This table is very amusing to me given the sheer number of companies clustered between $1.0b and $1.03b. There are 25 in that range, compared to 11 in the "emerging unicorn" list for the range $800-999m. Some of these funding round values were set to just cross over into "unicorn" range.

On a tangential note, Pure Storage hopes to be valued at up to $3.33b at IPO.

Here I am with my Titan of Technology award:

The crowd gasped when the MCs stated that Tintri hoped to go IPO next year with a valuation over a billion dollars. It was definitely a Dr. Evil moment, but I was never any good with the pinky finger.

The collection of attendees was a little bit odd. Quite a lot of real estate people. Nobody I saw from any of the banks, or from Target, or the major healthcare players, although Thomson Reuters was a sponsor. Tim had to sit next to some headhunters, but also somebody from Public Radio International. Phil Soran (Compellent, Vidku, etc., a winner last year) came up and introduced himself, and I said hi to Sona Mehring (CaringBridge and the Minne* board), who was awarded this year.
Friday, September 4th, 2015
9:32 am
DFA's aren't easy
A blog entry on "Open Problems That Might Be Easy" mentions this one:
Given a DFA, does it accept a string that is the binary representation of a prime number--- is this decidable?

I think this is sort of a tricky question in that it sounds simple but actually asks for very deep insights. To solve it you might have to know something about prime numbers which is also an open problem!

How hard are DFAs? After all, we have the pumping lemma which helps us prove that languages aren't regular. But, your computer is actually equivalent in power to a DFA (if we disconnect it from the Internet.) It has only finitely many states, and so the languages it can recognize are regular. So imagine this problem as "how can I tell whether a computer program, running on a Windows Server 2012, on a eight-core machine with 256GB of memory, ever accepts a prime number?" We're talking an unimaginably large number of states--- does it still seem reasonable that there's a computable way of analyzing its behavior?

But, we do have the pumping lemma. So if we could characterize that there is always a prime number of the form a ## bb...bb ## c, that might answer the question in the affirmative for any DFA for which we could identify its "sufficiently long" strings. But this is obviously false without further qualification--- take a = 1, b = 0, c = 0. If we use [math]a=1111...1111, b=1, c=1[/math] then the answer may depend on the existence of very large Mersenne primes. So it may be possible to write a DFA for which the answer "does it accept any primes" provides a deep number-theoretical result.
Thursday, August 20th, 2015
11:15 am
Tintri's all-flash array
Tintri made a bunch of product announcements today. We introduced our first all-flash storage appliances, the T5060 and T5080, which run the same Tintri OS as the hybrid models.

We also released Tintri OS 4.0 and Tintri Global Center 2.1 which support the new models, and presented the "Tintri VMstack" converged infrastructure program. Tintri partners are now selling a "pod" consisting of compute, networking, and Tintri storage as a single unit.

Press coverage:

The Register: http://www.theregister.co.uk/2015/08/20/tintri_adds_tincture_of_allflash/

Yellow Bricks (Duncan Epping's blog): http://www.yellow-bricks.com/2015/08/20/tintri-announces-all-flash-storage-device-and-tintri-os-4-0/

Computer Weekly: http://www.computerweekly.com/news/4500252043/Tintri-dives-into-all-flash-storage-with-the-T5000-series

IT Business Edge: http://www.itbusinessedge.com/blogs/it-unmasked/tintri-unveils-all-flash-arrays-optimized-for-virtual-servers.html

Computer World: http://www.computerworld.com/article/2973104/data-storage-solutions/tintri-covers-the-bases-all-flash-and-hybrid-flash-storage.html

Silicon Angle: http://siliconangle.com/blog/2015/08/20/tintri-launches-a-hyperconverged-platform-to-expand-the-appeal-of-its-new-flash-arrays/

vClouds (Marco Broeken's blog): http://www.vclouds.nl/tintri-added-2-new-all-flash-arrays-to-their-portfolio/
Friday, August 14th, 2015
10:25 pm
Speaking of comparables
Pure Storage filed their form S-1 in preparation for an IPO. It contains a few surprises.

The one most remarked upon is that Pure's market share is not quite what it was believed to be. Gartner estimated calendar year 2014 revenue of $276 million, while Pure reported fiscal year 2014 (offset by one month) revenue of $155m. The difference for 2013 was also large, on the order of $75m.

The Gartner analyst has apologized for his company's error. Pure, having entered their quiet period, cannot explain why they decided to let Gartner's mistake pass (they, like other companies Gartner covers, are given an opportunity to correct factual errors.) I haven't worked in analyst relations--- perhaps letting Gartner make errors is standard practice. Certainly I can understand if Pure decided keeping revenue numbers private outweighed getting an accurate representation in Gartner's market share breakdown.

Pure is proposing a dual class stock structure in which existing investors keep control of the company (their class B shares will have 10x the voting rights of the new class A common stock.) Opinions vary on the wisdom of this. I think it's something the market doesn't care about if your company is successful, and cares a lot about if things don't look bright. It's a little arrogant to assume that Pure is in the former category, but that has been Pure's marketing image from day one. :)

Something I *haven't* seen discussed elsewhere is the large payouts to the executive team (and early investors) in the form of stock repurchases.

In November 2013, we repurchased an aggregate of 3,045,634 shares of our outstanding Class B common stock at a purchase price of $6.9315 per share for an aggregate purchase price of $21.1 million, of which 557,842 shares of common stock were repurchased from David Hatfield, our President, for an aggregate price of $3.9 million. Mr. Hatfield was the only executive officer to participate in this tender offer.

In July 2014, we repurchased an aggregate of 3,803,336 shares of our outstanding Class B common stock at a purchase price of $15.7259 per share for an aggregate purchase price of $59.8 million. The following table summarizes our repurchases of common stock from our directors and executive officers in this tender offer.

Name Shares of Common Stock Purchase Price

Scott Dietzen(1) 192,051 $ 3,020,174

John Colgrove(2) 200,000 3,145,180

David Hatfield(3) 1,000,000 15,725,900

(1) Dr. Dietzen is our Chief Executive Officer and a member of our board of directors.
(2) Mr. Colgrove is our Chief Technology Officer and a member of our board of directors.
(3) Mr. Hatfield is our President.

In April 2014, Pure raised $225m. But $60m of that went right back out the door to existing stockholders. (In August 2013, Pure raised $150m, with $21m flowing back out.) Some of this outflow is mitigated by the exercise of stock options.

I can't speak to these individuals' financial situation and whether it made sense from a personal position for them to cash out. But from a company position, this seems excessive compensation for a company that hasn't yet proved itself. Together the three executives took $25.8m in cash out of the company (not counting any salary or bonuses) In all three cases, they were also loaned money by the company in order to purchase stock or exercise options, and these loans have been repaid, presumably with the proceeds from the stock buyback.

Finally, Pure is growing its revenue rapidly (although, as noted, not as rapidly as had been previously believed--- and EMC is all over it.) But it's losing a lot of money too. Net cash flow from operations in the most recent quarter was -$14m, and an extra -$6.7 in investment cash flow (including capital investment). That's not too bad, although the reported loss was more than $49m. Somebody more versed in accounting than me can probably explain out how they managed to pay that much operational cost without a corresponding drop in cash? (It's in there, depreciation and stock-based compensation and such.) For the fiscal year ending January 2015 they burned through $196m.

In fiscal 2015, Pure spent about $1 in sales and marketing for every $1 in product revenue: $154,836,000 in product sales (not counting support) and $152,320,000 sales and marketing (not counting any G&A or R&D.) EMC will hammer them for this too. It's a "get big fast" strategy which spends a lot of money every quarter trying to make the next quarter's sales even bigger. You can see this when you slice the data quarter by quarter:

4Q2015 sales and marketing: $42,533K
1Q2016 revenue: $74,077K

3Q2015 sales and marketing: $38,224K
4Q2015 revenue: $65,850K

2Q2015 sales and marketing: $46,448K
3Q2015 revenue: $49,189K

1Q2015 sales and marketing: $25,115K
2Q2015 revenue: $34,764K

This may be correct, but it's an expensive strategy and one that doesn't leave a lot of room for error. The return they're getting on a sales and marketing dollar is not consistently high.
Tuesday, August 11th, 2015
11:44 pm
Fireside financing
Our CFO gave the company an informal chat about our recent financing round today.

One of the interesting things he discussed (and which I feel fine talking about publicly) is comparables. Just like houses are priced based on similar houses in the neighborhood, and CEOs are paid based on what other CEOs are paid, private investments tend to get valued based on what other companies in the same business are worth.

Unfortunately, this causes a little bit of "headwind" for Tintri because our most direct comparable is Nimble. (There's also Violin, about which the less said the better.) None of the other storage startups have gone public.

Nimble Storage's IPO (as NMBL) was in December 2013, and closed that day at $33.93/share. Their peak was $52.74 in February, and 2014, and ever since they've just been wandering around $25-$30, 50% off the peak. In contrast, the S&P has gone up 17% over NMBL's time on the market.

It's not that they're doing particularly poorly (although they are losing money.) Revenue has continued to increase at a healthy rate, and they meet analyst expectations. But, this means their price/revenue ratio has been on a steep decline too. And that's what gets used for valuation. Combined with NTAP and EMC's woes, this makes investors question whether storage is where they want to put their money.

This is not a serious impediment to Tintri going public, and obviously didn't stop us from landing a sizable investment.
Saturday, August 8th, 2015
12:31 am
Comic Oligopolies
I watched the documentary "Stripped" last night, and enjoyed it. But it was trying to do too many things to really be a good documentary. The part I enjoyed most was hearing the artists talk about their process for "being funny" every day.

Many of the interviewees have created web-based comics, and so there was a longer-than-necessary section on "how do webcomic authors make money." (Like I said, it tried to do too many things.) There was also a fair amount of incomprehension from the older artists: "that's the part I want somebody else to take care of", "how do these kids make money", "I just like it when a bag of money shows up regularly", etc.

Although the comics syndicates do compete with each other, they form an oligopoly. And the gatekeeper function of that oligopoly is, I think, a large part of what kept artists paid. We know people will make comics--- even surprisingly good comics--- for a pittance, and that there are thousands of hopeful cartoonists out there. (One stat was something like 1 in 3500 applications gets accepted by the syndicates.) A gatekeeper can cut off this supply curve, thereby keeping prices (wages) high. You either are the top 1% and earn a decent living, or eventually go do something else with your life.

The web changes this, although in a complicated way because the revenue stream isn't the same. But it means nobody is saying "no" to a comic artist--- they might be a failure in the market, but they aren't getting rejected by an editor. That suggests that the distribution of money will look different, too.
Friday, August 7th, 2015
11:51 am
More press
The Minneapolis-St. Paul Business Journal selected me as one of its 2015 "Titans of Technology": http://www.bizjournals.com/twincities/morning_roundup/2015/08/2015-titans-of-technology-honorees-announced.html

There will be a more in-depth profile (or at least a photo!) later, and a fancy awards lunch in September. It's an honor to appear on the awards list with local leaders such as Clay Collins (of LeadPages) and Sona Mehring (of CaringBridge.) My award is in the category of "people responsible for the creation of breakthrough ideas, processes or products" along with a professor at UMN, Lucy Dunne, who studies wearable computing.
Wednesday, August 5th, 2015
8:53 am
Tintri announced our "Series F" funding round, led by Silver Lake. They do a lot of late-stage and private equity investment--- for example, Silver Lake also announced a $1b investment in Motorola today! It's great to get a vote of confidence from a sophisticated investor with deep pockets.

News coverage:








$125m is a lot. If we were a Minnesota-based company, Tintri would have tripled tech venture funding in the state (about $50m total in Q1 of this year.) But Silicon Valley is having a major funding boom, with $9.1 billion estimated in Q2 this year. It's also a smaller round than some of our competitors--- for example, Pure has raised about $530m total, including a series F of $225m.

Congratulations to everybody at Tintri!
Monday, July 20th, 2015
12:22 am
Fun with Bad Methodology
Here's a Bill Gross talk at TED about why startups fail. Mr. Gross identifies five factors that might influence a startup's success, analyzes 250 startups according to those factors, and analyzes the results.

This is possibly the worst methodology I have seen in a TED talk. Let's look at just the data presented in his slides. I could not find a fuller source for Mr. Gross's claims. (In fact, in his DLD conference talk he admits that he picked just 20 examples to work with.)

I did a multivariate linear regression--- this does not appear to be the analysis he performed. This produces a coefficient for each of the factors:

Idea	 0.021841942
Team	 0.033134009
Plan	 0.047012997
Funding	-0.03324002
Timing	 0.133669929

While this analysis agrees that "Timing" is the most important, it differs from Mr. Gross on what is second. It actually says that the business plan is a better predictor of success. That's strike one--- the same data admits multiple interpretations about importance. Note also that linear regression says more funding is actively harmful.

The second methodological fault is taking these numbers at face value in the first place. One might ask: how much of the variation in the dependent variable do these factors explain? For linear regression, the adjusted R^2 value is about 0.74. That means about 25% of the success is explained by some factor not listed here. What checks did Mr. Gross do to validate that his identified factors were the only ones?

While we're still on problems with statistical analysis, consider that the coefficients also come with error bars.

	Lower 95.0%	Upper 95.0%
Idea	-0.082270273	0.125954157
Team	-0.082753104	0.149021121
Plan	-0.049741296	0.143767289
Funding	-0.126226643	0.059746603
Timing	 0.034672747	0.232667111

The error bars are so wide that few useful statements can be made about relative ordering of importance. (Obviously this might change with 230 more data points. But it seems like Mr. Gross was operating on only 20 samples.)

Next let's transition to the data-gathering faults. Mr. Gross's sample of companies is obviously nonrandom. But it is nonrandom in a way that is particularly prone to bias. He lists startup companies that actually got big enough to attract sufficient attention that he'd heard of them! The even mix of success and failure when he's looking to explain the high rate of failure should be a huge red flag.

Suppose we add 15 more failed companies to the list that have an average timing of 7 (slightly better than the sample) but average on the other factors of 5.

Oops, now timing has slipped to third!

Idea	 0.079811441
Team	 0.067029361
Plan	-0.007925792
Funding	 0.033260711
Timing	 0.064723176

Not surprising, because I deliberately set things up that way. But Mr. Gross doesn't know what the "next" set of failures look like either. He has a complete list of IdeaLabs companies, but not the outside world, which is its own bias--- maybe it's Mr. Gross who is prone to timing failures, not the rest of the world!

Picking only large, well-known failures for your analysis is nearly the very definition of survivorship bias.

Finally, the inputs themselves are suspect even where they are complete. Mr. Gross already *knows* whether these companies are successes or failures when he's filling out his numbers. Does Pets.com really deserve a "3" in the idea category, or is that just retrospective thinking? Why are the successful IdeaLabs companies given 6's and 7's for Team, but the unsuccessful ones get two fours and a five? Did Mr. Gross really think he was giving those two ventures the "B" team at the time?

Even the Y axis is suspect! Pets.com had a successful IPO. Uber and AirBnB can't claim to have reached that stage--- they might yet implode. (Unlikely, admittedly, but possible. And their revenue numbers are better than Pets.com ever was.) As an investor, "Did I receive any return on my investment" is the measure of success.

To summarize,
* The data were generated from the opinions of the principal investigator and not subject to any cross-checks from other parties.
* The data exhibit survivorship bias and other selection bias.
* The analysis includes no confidence intervals or other measurements of the meaningfulness of the results.
* The results presented appear highly dependent upon the exact analysis performed, which was not disclosed.

Am I being unfair? Perhaps. This was not an academic conference talk. But if we are to take his advice seriously, then Mr. Gross should show that his analysis was serious too.
Sunday, July 12th, 2015
1:03 am
I'm reading "The Royalist Revolution" which tells a pretty good story about how the U.S. ended up with a strong executive, and how many of our founding fathers revived Royalist arguments. As in, James-and-Charles-sixteenth-century political thought. It's also a book about how ideas have consequences!

But, the sheer amount of pie-in-the-sky idealization on both sides of the monarchist/parlimentarian debate beggars belief, and nobody seems to have called them on it. In some cases this might just be due to lack of relevant experience.

A common theme is that a strong executive (or royal prerogatives) is needed to combat the despotism of sectional and factional interests. To be fair, the Long Parliament was not an exactly inspiring example of how a single-body legislature would behave. But the notion that a king cannot diminish one part of his realm without diminishing himself is pure hogwash, and can only be written by somebody who has not really grasped the notion of "penal colony." Similarly, claims that a chief magistrate / governor would, by benefit of being selected by the whole state, be beholden "to no single faction or segment of society" can be made with a straight face only if you haven't experienced modern political parties. But these claims were made, apparently in good faith, by otherwise intelligent men.

On the other side, the idea that Parliament is able to represent the population "virtually" because it is large and closely resembles the voting public (so how could it harm itself?) has shortcomings that were obvious even at the time.

I am also somewhat surprised to learn, at this late date (considering my Christian-school education), that Paine's "Common Sense" spent a lot of time exploring Biblical condemnation of the office of kingship.
Thursday, July 9th, 2015
12:26 am
I shouldn't get upset by linkbait
Look, I respect skepticism. Thomas the Doubter is a saint not just in spite of his doubt.

But Salon has been having a lot of low-quality articles lately such as "5 good reasons to think that Jesus never existed" and it's beginning to bug me.

I feel like there's a cyclic behavior here in which each generation which feels oppressed by the organized Christianity of its day comes up with lots of reasons (good and bad) why Christianity is probably false. Christianity has done bad things, therefore its roots are probably made up. (If they were not, how could Christianity be responsible for those bad things?)

Some of this leads to useful philosophy (Spinoza!) and historical criticism. Most just gets ignored and the next generation comes up with a different set of reasons. Questioning the foundations of Christianity convinces few Christians to mend their ways. And it's not like we're lacking in a reasonable intellectual foundation for atheism and agnosticism.

This article is particularly bad. The five reasons are:

"1. No first century secular evidence whatsoever exists to support the actuality of Yeshua ben Yosef": Not an argument against his nonexistence. Same can be said for Socrates, if you ignore all the writing influenced by Socrates. (Yes, I realize this point has been debated to death.)

"2. The earliest New Testament writers seem ignorant of the details of Jesus’ life, which become more crystalized [sic] in later texts." An argument for mythologization but not absence. Arguing that Paul didn't know very much about Jesus' life does not imply he placed him in the distant past.

"3. Even the New Testament stories don’t claim to be first-hand accounts." "4. The gospels, our only accounts of a historical Jesus, contradict each other." Certainly the Gospels were written to support already-existing Christianity, not spark its existence. They are not biographies and were not meant to be. If you accept them as texts meant to convey a point, it seems reasonable the authors (and the Holy Spirit, if one is a Christian) have organized events in a thematic manner. Does that imply there were not actual people who did the events portrayed?

"5. Modern scholars who claim to have uncovered the real historical Jesus depict wildly different persons." Of all the "good reasons", this is by far the worst. Attempts to find the historical Jesus often start by simply discarding everything that makes Jesus distinctive (his claims, words, and miracles) so it is not surprising what is left is a nonentity.
Saturday, July 4th, 2015
1:05 am
Stupid NFS Tricks
One of the Tintri SE's asked whether the advice here about tuning NFS was relevant to us: http://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch18_01.htm (It is not.)

Back in the days where CPU cycles and network bits were considered precious, NFS (the Network File System) was often run over UDP. UDP is a connectionless, best-effort "datagram" service. So occasionally packets get lost along the way, or corrupted and dropped (if you didn't disable checksumming due to aforesaid CPU costs.) But UDP doesn't provide any way to tell that this happened. Thus, the next layer up (SUN-RPC) put a transaction ID in each outgoing call, and verified that the corresponding response had arrived. If you don't see a response within a given time window, you assume that the packet got lost and retry.

This causes problems of two sorts. First, NFS itself must be robust against retransmitted packets, because sometimes it was the response that got lost, or you just didn't wait long enough for the file server to do its thing. This is not so easy for operations like "create a file" or "delete a file" where a second attempt would normally result in an error (respectively "a file with that name already exists" and "there's no such file.") So NFS servers started using what's called a "duplicate request cache" which tracked recently-seen operations and if an incoming request matched, the same response was echoed.

The second problem is figuring out what the appropriate timeout is. You want to keep the average cost of operations low, but not spend a lot of resources retransmitting packets. The latter could be expensive even if you don't have a crappy network--- say the server is slow because it's overloaded. You don't want to bombard it with extra traffic.

Say you're operating at 0.1% packet loss. A read (to disk) normally takes about 10ms when the system is not under load. If you set your timeout to 100ms, then the average read takes about 0.999 * 10ms + 0.000999 * 110ms + 0.000000999 * 210ms + so on, about 10.1ms. But if the timeout is a second, that becomes 11ms, and if the timeout's 10 seconds then we're talking 20ms.

So, at least in theory, this is worth tuning because a 2x difference between a bad setting and a good setting is worthwhile. Except that the whole setup is completely nuts.

In order to make a reasonable decision, the system administrator needs statistics on how long various NFS calls tend to make, and the client captures this information. But once you've done that, why does the system administrator need to get involved? Why shouldn't the NFS client automatically adapt to the observed latency, and dynamically calculate a timeout value in keeping with a higher-level policy? (For a concrete example, the timeout could be set to the 99th percentile of observed response times, for an expected 1% over-retransmit rate.) Why on earth is it better to provide a tuning guide rather than develop a protocol implementation which doesn't need tuning? This fix wouldn't require any agreement from the server, you could just do it right on your own!

Fortunately, in the modern world NFS mainly runs over TCP, which has mechanisms which can usually tell more quickly that a request or response has gone missing. (Not always, and some of its specified timeouts suffer the same problem of being hard-coded rather than adaptive.) This doesn't remove the first problem (you still need the duplicate request cache) but makes an entire class of tuning pretty much obsolete.

Nothing upsets me more in systems research and development, than a parameter that the user must set in order for the system to work correctly. If the correct value can be obtained observationally or experimentally, the system should do so. If the parameter must be set based on intuition, "workload type", expensive consultants, or the researcher's best guess then the real problem hasn't been solved yet.
Monday, June 29th, 2015
12:03 am
Keeping this short, but longer than Twitter
I really should not engage with my Twitter followers and co-workers who are engaging in gloom-and-doom about the status of Christianity in the United States. But really, guys, grow a spine. Christians experience far worse persecution in the world today than the "threat" of having to treat a married couple like a married couple. Christianity grew and flourished in an environment where the dominant religion was the Hellenistic pantheon. I think it can handle getting a little less of its way from the state for a while. (I can't wait to see what sort of lame-ass civil disobedience the local rabble-rousers think up.)

It bothers me that this is treated as a Christian issue rather than a some-Christian issue. Guess what, Christians were on both sides of previous marriage issues too.

That gets to the essence of religious liberty. If religious liberty means anything, it's that the state won't always agree with your sect. Even on those matters you consider a threat to that state! (I am reading "Amsterdam: A History of the World's Most Liberal City", by the way, so this is something on my mind what with Spinoza and all that.)

It also bothers me whenever Christians let obsession with sin get in the way of proclaiming the Good News that sin is forgiven. (Not recorded, not obsessed over, not punished by natural disaster, not hair-split by doctrine into tiers, not encoded into civil law, but washed away as if it had never been. Permitted, even, for the Greek word for "forgive" has that connotation as well!)

I have also noticed a small number of critiques of the gay marriage movement coming from the queer community. I can accept their argument that the focus on gay marriage may have attracted attention and support at the expense of other queer causes. I am a little less sanguine about excluding the cisgendered community from those seeking alternatives to marriage (a tradition that extends all the way back to naked Anabaptists parading through the streets of Amsterdam, if not further.)
Tuesday, June 9th, 2015
10:38 pm
MinneAnalytics trip report
I attended the MinneAnalytics "Big Data Tech" conference on Monday. I think my session choices were poor. What I wanted was success stories about using various technologies. What I got was mainly pitches in disguise (despite being forbidden by the conference organizer.)

Mary Poppendieck, "Cognitive Bias in a Cloud of Data". Mainly about System 1 and System 2 thinking. Talked about overcoming bias by keeping multiple options open and finding multiple opinions (dissenters.) Not very big-dataish, though she talked about how even big data requies some System 1 thinking (expertise) to design, analyze, and store data.

Dan McCreary (MarkLogic), "NoSQL and Cost Models". Some interesting points about how to get at the Total Cost of Ownership of your database. If a NoSQL database can scale to serve all your applications, that avoids the cost of ETL and duplicate data. Also talked about the need to agree on a standard format for data to avoid quadratic scaling costs as the number of applications increase. Made some remarks about the lower cost of parallel transforms.

Ravi Shanbhag (United Healthcare), "Apache Solr: Search is the new SQL". Basic intro. I found the need to define a schema sort of off-putting, and the idea of "dynamic fields" where you pattern-match on field names even worse. I think we can do better, as the next session showed.

Keys Botzum (MapR), "SQL for NoSQL", about Apache Drill. Pretty good slideware demo showing how to use Drill to run SQL queries across JSON files. Would have liked to see a multi-source demo as well but Drill can handle this. Considering whether to try it out with Tintri autosupports. Definitely the best talk of the day, speaker was enthusiastic and the technology is cool.

Frank Catrine and Mike Mulligan (Pentaho), "Internet of Things: Managing Unstructured Data." This talk made me less interested in the company than dropping by their booth. Tedious explanation of the Internet of Things and lots of generalities about the solution. Gave customer use cases that architecturally all looked the same but repeated the architecture slide anyway. (Surprise--- they all use Pentaho!)

(I never did write up an entry on this year's MinneBar.)
Friday, May 29th, 2015
12:23 am
Internet arguments about math I have participated in, because I'm on vacation
1. Is there such thing as an uncomputable integer? Or, even if a function F(x) is uncomputable, does, say F(10^100) still count as computable because, hey, some Turing machine out there somewhere could output it! Or does a number count as computable only if we (humans) can distinguish that machine from some other machine that also outputs a big number?

Consider that if F() is uncomputable, it seems extremely dubious that humans have some mechanism that lets them create Turing machines that produce arbitrarily-large values of F.

I'm probably wrong on the strict technical merits but those technical grounds are unduly Platonist for my taste.

2. Is trial division the Sieve of Eratosthenes? Maybe it's the same algorithm, just a particularly bad implementation!

I'm with R. A. Fisher on this one. If the idea was just some way to laboriously generate primes, it wouldn't be famous, even among Eratosthenes' contemporaries. The whole thing that makes the sieve notable is that you can apply arithmetic sequences (which translate into straight lines on a grid!) to cut down your work.

The counterargument seems to be that Eratosthenes might well have performed "skip, skip, skip, skip, mark" instead of jumping ahead by five.
Saturday, May 9th, 2015
7:02 pm
The term "Hyper-Converged" is the biggest scam ever
"Hyper-Converged" infrastructure solutions, like Nutanix, combine compute, storage, and a hypervisor.

"Converged" infrastructure, like a NetApp FlexPod, combines compute, storage, networking, and optionally some sort of application stack (like a hypervisor or a database.)

Which is more converged? The one with networking or the one without?

Maybe we should call the former "hypo-converged" if you need to buy your own switch.
Sunday, May 3rd, 2015
9:24 pm
Big Omaha
On SwC, the 5-card Omaha variants seem more popular than the 4-card standard version.

Suppose you're up against a flop of 688 and you hold something like 89TQK. How much of a favorite is the made baby boat 68xxx?

The surprising answer: barely at all. You have 12 outs twice, with a 42 card stub. You'll make the bigger boat 1 - 30/42 * 29/41 = about 0.495 of the time.

Of course, Villain might be able to make a low (if you're playing O/8) or hold some of your overcards himself, so your equity is not really 49%. But against something like 668JJ you might even be the favorite. That extra card really makes a difference.
Saturday, May 2nd, 2015
1:21 am
Never Trust Any Published Algorithm
The slides from my Minnebar talk, "Never Trust Any Published Algorithm", can be found here.

I got a couple good stories in response.

A local person who works in computer vision talked about finding a new algorithm for shadow detection. It worked great on the graduate student's ten examples. It didn't work at all on his company's hundreds of examples in all different lighting and weather conditions.

A coworker shared:

A few years ago my brother was working at an aircraft instrument company. He was
looking for an algorithm that described max loading for a plane still able to take off
based on many variables.

He found what was supposed to be the definitive solution written by a grad student at
USC. He implemented the algorithm and started testing it against a large DB of data
from actual airframe tests. He quickly found that the algorithm worked just fine for
sea-level examples, but not as airport altitude rose. He looked through the algorithm
and finally found where the math was wrong. He fixed his code to match his new
algorithm and found that it now matched the data he had for testing.

He sent the updated algorithm to the grad student so he could update it on public
servers. He never heard back nor did he ever see the public code updated.

The example I put as a bonus slide wasn't previously mentioned in my series of blog posts is a good one too. In Clustering of Time Series Subsequences is Meaningless the authors showed that an entire big data technique that had been used dozens of times in published papers produced essentially random results. (The technique was to perform cluster analysis over sliding window slices of time series.)
1:05 am
Tintri brag
When we ask our customers "Would you recommend Tintri to a friend or colleague?" an astonishingly large number say yes.

One metric associated with this question is the Net Promoter score. You ask the above question on a scale of 0 to 10 (least to most likely). 9's and 10's count as positive. 7's and 8's are neutral. Anything 6 and below counts as negative. Take the percentage of positive responses, and subtract the percentage of negative responses.

Tintri scores a 94. Nearly every customer (who responds to surveys, anyway...) gives us a 9 or 10. Softmetrix Systems benchmarks Net Promoter scores by industry, and the leaders like USAA, Amazon, Apple, and Trader Joe's tend to have scores in the 70s and 80s.

It makes me incredibly happy that customers love our product so much. I almost feel like it's all downhill from here--- it'll be a big challenge as we grow to keep that level of customer satisfaction so high. Maybe it's even a sign that we're not selling as much as we should be! But it's great confirmation of the quality of the product and of the Tintri team, who I'm all very proud of.
Friday, April 24th, 2015
6:08 pm
Infinitesimals in the family of asymptotics
I answered this question about the Master Theorem (which is used to calculate the asymptotic behavior of divide-and-conquer algorithms): Why is the recurrence: T (n) = sqrt(2) *T (n/2) + log(n) a case 1 of master method?

The source of confusion (see the comments to my answer) seemed to be that the original poster did not really understand the asymptotic behavior of log(n) and n/log(n). In fact, he went so far as to post a graph "proving" that n^0.99 is larger than n/log(n). However, this is false for large enough numbers (large enough being somewhere around 10^100 in this case.) The logarithm grows more slowly than any positive power of n. As a consequence, n/log(n) grows asymptotically faster than any positive power of n less than 1.

What I realized is that this might be some student's first (and maybe only!) experience with infinitesimals! Normally we think of infinitesimals as numbers "smaller than any positive real number". For example, before epsilon-delta proofs and modern analysis, the differential and integral calculus informally treated dx as an infinitesimal value. But while students are told the "idea" is that dx is a very small value, they are repeatedly cautioned not to treat it as an actual number.

The surreal numbers (and thus the class of combinatoric Games too) have infinitesimals, but they are far outside the normal curriculum.

So what model is a student supposed to apply when confronted with O(log n) or Θ(log n)? It behaves exactly like an infinitesimal in the family of asymptotic limits. big-O = bounded above, Ω = bounded below, and Θ = bounded both above and below, thus:

log n = Ω( 1 )

log n = O( n^c ) for any c > 0, i.e., log n = O(n), log n = O(sqrt(n)), log n = O(n^0.0001)

n^k log n = O( n^c ) for any c > k.

n / log n = Ω( n^c ) for any c < 1

Nor does taking a power of the logarithm make any difference.

log^100 n = (log n)^100 = O( n^c ) for any c > 0

Crazy, right? But we can get Wolfram Alpha to calculate the crossover point, say, when does (log n)^100 = n^0.01? At about 4.3*10^50669.

That's what makes logarithm an infinitesimal. No matter what power we raise it to (think "multiply") it is still smaller in its asymptotic behavior than the smallest power of n (think "any positive real number.") And there's a whole family of infinitesimals hiding here. log log n is infinitesimally smaller than log n. Don't even ask about the inverse Ackermann function or the iterated-log function.

So it's not surprising that students might have difficulty if this is their first journey outside the real numbers. Everybody can handle the fact that O(n^2) and O(n^0.99) and O(n^0.5) are different, but the fact that none of these examples will be O(log n) is kind of perplexing, because O(log n) is obviously bigger than O(n^0). (Jumping to exponentials like O(2^n) seems like an easier stretch.)

What was your first encounter with an infinitesimal?
[ << Previous 20 -- Next 20 >> ]
My Website   About LiveJournal.com