Matthew Hancock

Follow @matthewhancock
Issues with HealthCare.gov Launch
I was interviewed for a Reuters article talking about issues impacting performance of HealthCare.gov. The article includes two issues I encountered with the site. The origin of this is the security questions not loading on the sign-up form, which led to me noticing the vast number of resources being requested all at the root www.healthcare.gov domain name.

The feedback from the article is about what I expected. People are saying 92 resources aren't that much, such as this tweet:

Doing a request to reuters.com I saw 128 requests, but more importantly only 24 were going to the http://www.reuters.com main domain. The 92 requests I mentioned were all going to http://www.healthcare.gov meaning every request was being directed to the same servers or load balancers. And yes, this is entirely manageable. However, in the context of Development Seed's focus on the number of servers being used in this great piece by Alex Howard is what causes my concern. They discuss walking into an expectation of 32 servers and they were able to get that number down to 2. Obviously for the full launch they would increase that number, but too few servers and too many simultaneous requests will definitely cause issues. The site stayed up, and returning CSS or Javascript files only takes a millisecond or two (if done correctly) from the server's perspective, but each request, assuming no asynchrony was added, locks up a thread and eventually peaks a server's capacity to handle new requests. Based on how the communication works with the backend system (discussed next), there is potential for socket connection issues on the client side and the server side when communicating to the backend system via web services. There is also talk of using Akamai for CDN so it is unclear if healthcare.gov points to the CDN, or if certain paths are routed to the CDN as a sort of wrapper. Seeing as the backend system is reached via the intranet, it would be interesting to see the specific network structure and how the CDN is integrated.

The main point to this is that best practices weren't followed. Some CSS and Javascript files weren't minified. The files certainly weren't bundled to be included in a single response. The code was let out the door with 56 separate Javascript requests to the same domain instead of a dedicated domain for resources. The 56 Javascript files should have been combined into 1 minified file and returned on a CDN with a separate domain name. If there was a laser-like focus on scalability and minimizing necessary capacity, that should have been the end result. This isn't a root cause issue for the site being brought down, it added undue strain and calls into question the code review process taken before moving the site into production.

The second issue, and I think the more important one in terms of bringing the site down, is how the backend system was accessed. I looked at requests using the Console in Firefox to determine why security questions weren't loading. I noticed an Internal Server Error (error code: 500) being returned for the URL: https://www.healthcare.gov/ee-rest/ffe/en_US/MyAccountEIDMUnsecuredIntegration/fetchAllSecurityQuestions/ffm. It seemed 50/50 in terms of returning an actual response. It returns JSON that contains the security questions. Why this is returned in a separate request and pieced together on the client side I don't know. When the error occurs during a request, the full stacktrace is returned to the client within the JSON. This is another ignored best practice as it gives information on how the code works to unauthorized users and can potentially cause issues.

javax.xml.ws.WebServiceException: Could not send Message.
at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:145)
at $Proxy3761.viewChallengeQuestions(Unknown Source)
at gov.hhs.cms.eidm.ws.client.eidmsystem.api.challengeqstns.ChallengeQuestions_ChallengeQuestionsService_Client.viewChallengeQuestions
(ChallengeQuestions_ChallengeQuestionsService_Client.java:60)
at gov.hhs.cms.eidm.ws.client.eidmsystem.api.challengeqstns.ChallengeQuestions_ChallengeQuestionsService_Client.viewChallengeQuestions
(ChallengeQuestions_ChallengeQuestionsService_Client.java:88)
at gov.hhs.cms.eidm.ws.proxy.service.impl.BaseEidmProxyServiceImpl.fetchSecurityQuestions(BaseEidmProxyServiceImpl.java:180)
at sun.reflect.GeneratedMethodAccessor708.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:173)
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:89)
at org.apache.cxf.jaxws.JAXWSMethodInvoker.invoke(JAXWSMethodInvoker.java:61)
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:75)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at org.apache.cxf.workqueue.SynchronousExecutor.execute(SynchronousExecutor.java:37)
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:106)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:263)
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:123)
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:207)
at org.apache.cxf.transport.servlet.ServletController.invokeDestination(ServletController.java:213)
at org.apache.cxf.transport.servlet.ServletController.invoke(ServletController.java:193)
at org.apache.cxf.transport.servlet.CXFNonSpringServlet.invoke(CXFNonSpringServlet.java:126)
at org.apache.cxf.transport.servlet.AbstractHTTPServlet.handleRequest(AbstractHTTPServlet.java:185)
at org.apache.cxf.transport.servlet.AbstractHTTPServlet.doPost(AbstractHTTPServlet.java:108)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:637)
at org.apache.cxf.transport.servlet.AbstractHTTPServlet.service(AbstractHTTPServlet.java:164)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:183)
at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:95)
at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.jboss.web.tomcat.service.request.ActiveRequestResponseCacheValve.internalProcess(ActiveRequestResponseCacheValve.java:74)
at org.jboss.web.tomcat.service.request.ActiveRequestResponseCacheValve.invoke(ActiveRequestResponseCacheValve.java:47)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:436)
at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:385)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:451)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.SocketException: SocketException invoking http://10.153.199.69:8663/ChallengeQuestionsService*: Connection reset
at sun.reflect.GeneratedConstructorAccessor4142.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.mapException(HTTPConduit.java:1431)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1416)
at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:649)
at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:263)
at org.apache.cxf.endpoint.ClientImpl.doInvoke(ClientImpl.java:533)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:463)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:366)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:319)
at org.apache.cxf.frontend.ClientProxy.invokeSync(ClientProxy.java:88)
at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:134)
... 49 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:695)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponseInternal(HTTPConduit.java:1542)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponse(HTTPConduit.java:1494)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1402)
... 59 more


It shows a client side resource containing JSON being populated by a web service located on the intranet (Note the Caused by: java.net.SocketException: SocketException invoking http://10.153.199.69:8663/ChallengeQuestionsService) showing an IP address that is reserved for internal network usage and that the server at that address (or servers behind a load balancer) were overwhelmed causing the frontend to stop working.

My expectation is that these security questions aren't going to change much, if at all, so there is no reason for this web service to be called frequently. Seeing as this failed regularly and I wasn't the only one using the site, obviously these results weren't being cached. There was one resource loaded in a similar way at https://www.healthcare.gov/ee-rest/ffe/en_US/MyAccountEIDMUnsecuredIntegration/fetchEIDMValidations/ffm which just loads regular expressions. Again, why this needs to be loaded in this way is unclear. Since the create account form is the only part people seemed to be able to get through and that these were the only requests that might have been called the backend system (from what I can see as a frontend user). It seems crazy that this could bring the entire system down, but it might just be the case. With basic caching this would have been avoidable. Without caching, this application should have never been let out the door.

Finally, the overall architecture is confusing. When I think of backend systems, I think of batch processes or specific functions being run behind the scenes that aren't integral or synchronous with the frontend system. Submitting a form which posts to a server which then in turn posts to another server, waits for a response from that server, and returns a response from the previous response is an unnecessarily abstracted way for a high-performance application to run. It might not run that way, but that's how it looks based on the stacktrace I received and from the existing discussion of segregated frontend and backend systems.

All in all, the web site's failed launch is unfortunate. As a Democrat I am disappointed in anything that makes it unnecessarily difficult for people to get access to health care. The errors that I've seen, and the many others that I haven't, disappoint me as a developer interested in well-designed code. Of the millions of users to the site, there are undoubtedly some who took time out of their day to go to the library or somewhere with internet access which they can't afford on their own, only to be unable to get through the site. I also have concern for those, as Alex Howard tweeted, that with a proper launch would "have hours of their days back & wouldn't have lost trust in government to provide services online."

Conservative media is having a field day with the "Obamacare glitches" that have nothing to do with Obamacare or the government. These are private contractors that were unable to deliver. The main blame goes to CGI which is implementing many state-level exchanges and has had issues with being behind schedule and over budget. Sharon Begley, who wrote the Reuters article, told me CGI was unavailable for comment. The article's scope seemed to be scaled back from our initial discussion, so it's unfortunate more analysis into the players responsible for the site's failed launch isn't immediately available. I, and others, can only do so much analysis of the code from a web browser when most of the site's functionality is unavailable. So people can critique my analysis, although I'd disagree with it being "wrong."

As I tweeted last week:
In the absence of the shutdown, all of the focus would be on the site's errors so the GOP did Obamacare a huge favor. Despite the shutdown, at least some programmers are deemed critical enough to work on debugging the site. I hope when it re-launches after the weekend that the site is able to provide what the people of America deserve: easy access to affordable health care.

Ultimately, this should never have been contracted out. Those working in the IT department at DHHS would have skin in the game and would be better able to deliver. The downward pressure on government employees makes the government less effective, but that might be the point.