In my previous workplace, Pocket was our benchmark for testing the quality of the product that we were building. It's not a pocket alternative, but the core component were similar. i.e we want to parse html and extract the sensible / required information from any given link. We were able to make some progress.
I have written the following components like resolver, web crawler, etc,. the knowledge of which I can contribute to this development. We have learned that simple HTML parsing is not going to help us. Though web has set it's own standards for HTML through W3C, the probability of a site following those standards is very very low which makes it worse to parse and look for content and moreover not all the pages are structured in a same way. Writing rules for parsing one specific site like wikipedia is easy, but structures differ by sites and there are millions of sites out there.
Now that I have defined what the core issue to be solved is, we have something called Portia which is what we should experiment with first. From here we can proceed. Visual bots are the key to proceed here. We need to look into machine learning stuff, NLP is of less importance here, but we cannot ignore it completely.
I can contribute in terms of design architecture, code contribution.
Phase 1 would be to focus on the prototype for the core module. Once we have a prototype, we can start tuning it and build other modules around it.